Mittwoch, 8. Februar 2017

Abstract Neural Machines at ICLR 2017

With my recent excitement for abstract neural machines (see e.g. the related NAMPI workshop), I skimmed the long list of submission of the ICRL conference for interesting papers.
(I might have missed one or the other people [please let me know], but this is what I found.)

I will post a follow-up post as I worked my way through all the papers.

Accepted Papers

Making Neural Programming Architectures Generalize via Recursion
Abstract: Empirically, neural networks that attempt to learn programs from data have exhibited poor generalizability. Moreover, it has traditionally been difficult to reason about the behavior of these models beyond a certain level of input complexity. In order to address these issues, we propose augmenting neural architectures with a key abstraction: recursion. As an application, we implement recursion in the Neural Programmer-Interpreter framework on four tasks: grade-school addition, bubble sort, topological sort, and quicksort. We demonstrate superior generalizability and interpretability with small amounts of training data. Recursion divides the problem into smaller pieces and drastically reduces the domain of each neural network component, making it tractable to prove guarantees about the overall system’s behavior. Our experience suggests that in order for neural architectures to robustly learn program semantics, it is necessary to incorporate a concept like recursion.
Lie-Access Neural Turing Machines
Abstract:  Recent work has demonstrated the effectiveness of employing explicit external memory structures in conjunction with deep neural models for algorithmic learning (Graves et al. 2014; Weston et al. 2014). These models utilize differentiable versions of traditional discrete memory-access structures (random access, stacks, tapes) to provide the variable-length storage necessary for computational tasks. In this work, we propose an alternative model, Lie-access memory, that is explicitly designed for the neural setting. In this paradigm, memory is accessed using a continuous head in a key-space manifold. The head is moved via Lie group actions, such as shifts or rotations, generated by a controller, and soft memory access is performed by considering the distance to keys associated with each memory. We argue that Lie groups provide a natural generalization of discrete memory structures, such as Turing machines, as they provide inverse and identity operators while maintain differentiability. To experiment with this approach, we implement several simplified Lie-access neural Turing machine (LANTM) with different Lie groups. We find that this approach is able to perform well on several algorithmic experiments, and outperforms RNN-based methods.
Program Synthesis for Character Level Language Modeling
Abstract: We propose a statistical model applicable to character level language modeling and show that it is a good fit for both, program source code and English text. The model is parameterized by a program from a domain-specific language (DSL) that allows expressing non-trivial data dependencies. Learning is done in two phases: (i) we synthesize a program from the DSL, essentially learning a good representation for the data, and (ii) we learn parameters from the training data - the process is done via counting, as in simple language models such as n-gram. Our experiments show that the precision of our model is comparable to that of neural networks while sharing a number of advantages with n-gram models such as fast query time and the capability to quickly add and remove training data samples. Further, the model is parameterized by a program that can be manually inspected, understood and updated, addressing a major problem of neural networks.

From a software-engineering point of view, I feel that this approach is very promising, as it offers somewhat interpretable output in form of a domain-specific language:

Neuro-Symbolic Program Synthesis
Abstract: Recent years have seen the proposal of a number of neural architectures for the problem of Program Induction. Given a set of input-output examples, these architectures are able to learn mappings that generalize to new test inputs. While achieving impressive results, these approaches have a number of important limitations: (a) they are computationally expensive and hard to train, (b) a model has to be trained for each task (program) separately, and (c) it is hard to interpret or verify the correctness of the learnt mapping (as it is defined by a neural network). In this paper, we propose a novel technique, Neuro-Symbolic Program Synthesis, to overcome the above-mentioned problems. Once trained, our approach can automatically construct computer programs in a domain-specific language that are consistent with a set of input-output examples provided at test time. Our method is based on two novel neural modules. The first module, called the cross correlation I/O network, given a set of input-output examples, produces a continuous representation of the set of I/O examples. The second module, the Recursive-Reverse-Recursive Neural Network (R3NN), given the continuous representation of the examples, synthesizes a program by incrementally expanding partial programs. We demonstrate the effectiveness of our approach by applying it to the rich and complex domain of regular expression based string transformations. Experiments show that the R3NN model is not only able to construct programs from new input-output examples, but it is also able to construct new programs for tasks that it had never observed before during training.

Neural Program Lattices
Abstract: We propose the Neural Program Lattice (NPL), a neural network that learns a hierarchical program structure from a mixture of strong supervision and weak supervision. Our starting point is the recent work of Neural Programmer-Interpreters (NPI), which can only learn from strong supervision (full program execution traces). NPLs can additionally learn from weak supervision consisting of flat sequences of elementary operations. We demonstrate the capability of NPL to learn to perform long-hand addition and arrange blocks in a grid-world environment. Experiments show that it performs on par with NPI while using weak supervision in place of most of the strong supervision, thus indicating its ability to impute latent program abstraction structure from examples containing only weak supervision.

Rejected Papers

Getting rejected from a conference is by no means a sign that the paper doesn't have good ideas; it might just not fit the reviewers' taste, or the advances it proposes might be more on a level of engineering than conceptually. The NIPS experience also revealed a large amount of randomness in the decision-making progress.
Thanks to the open review process, everyone can see what was criticized and led to the rejection.

Abstract: In this paper, we extend neural Turing machine (NTM) into a dynamic neural Turing machine (D-NTM) by introducing a trainable memory addressing scheme. This addressing scheme maintains for each memory cell two separate vectors, content and address vectors. This allows the D-NTM to learn a wide variety of location-based addressing strategies including both linear and nonlinear ones. We implement the D-NTM with both continuous, differentiable and discrete, non-differentiable read/write mechanisms. We investigate the mechanisms and effects for learning to read and write to a memory through experiments on Facebook bAbI tasks using both a feedforward and GRU-controller. The D-NTM is evaluated on a set of Facebook bAbI tasks and shown to outperform NTM and LSTM baselines. We also provide further experimental results on sequential MNIST, associative recall and copy tasks.

Abstract: Memory networks are neural networks with an explicit memory component that can be both read and written to by the network. The memory is often addressed in a soft way using a softmax function, making end-to-end training with backpropagation possible. However, this is not computationally scalable for applications which require the network to read from extremely large memories. On the other hand, it is well known that hard attention mechanisms based on reinforcement learning are challenging to train successfully. In this paper, we explore a form of hierarchical memory network, which can be considered as a hybrid between hard and soft attention memory networks. The memory is organized in a hierarchical structure such that reading from it is done with less computation than soft attention over a flat memory, while also being easier to train than hard attention over a flat memory. Specifically, we propose to incorporate Maximum Inner Product Search (MIPS) in the training and inference procedures for our hierarchical memory network. We explore the use of various state-of-the art approximate MIPS techniques and report results on SimpleQuestions, a challenging large scale factoid question answering task.

Not about Neural Abstract Machines, but related Papers

This is not a typical neural abstract machine, but it follows the idea of using differentiable elements to make things trainable that otherwise wouldn't be:

Neural Functional Programming
Abstract: We discuss a range of modeling choices that arise when constructing an end-to-end differentiable programming language suitable for learning programs from input-output examples. Taking cues from programming languages research, we study the effect of memory allocation schemes, immutable data, type systems, and built-in control-flow structures on the success rate of learning algorithms. We build a range of models leading up to a simple differentiable functional programming language. Our empirical evaluation shows that this language allows to learn far more programs than existing baselines.

This is slightly different, but since I had quite some exposure to graph transformation and related formalisms in my undergrad, I find this fascinating, too:
Learning Graphical State Transitions
Abstract: Graph-structured data is important in modeling relationships between multiple entities, and can be used to represent states of the world as well as many data structures. Li et al. (2016) describe a model known as a Gated Graph Sequence Neural Network (GGS-NN) that produces sequences from graph-structured input. In this work I introduce the Gated Graph Transformer Neural Network (GGT-NN), an extension of GGS-NNs that uses graph-structured data as an intermediate representation. The model can learn to construct and modify graphs in sophisticated ways based on textual input, and also to use the graphs to produce a variety of outputs. For example, the model successfully learns to solve almost all of the bAbI tasks (Weston et al., 2016), and also discovers the rules governing graphical formulations of a simple cellular automaton and a family of Turing machines.


In linguistic applications, tasks typically are translating a sentence, or deciding whether a given string belongs to a specific language. In the past, popular models to learn rules were finite state machines, pushdown automata, and hidden Markov models. We understand these models fairly well, and they each describe a class in the Chomsky hierarchy. This makes them very apt to model formal systems. But when it comes to describing natural language and solving problems in NLP, the rules imposed by formal grammars are often too strict and limited to model human writing and speech.

In recent years, neural models outperform automata models, especially deep networks, when solving real-world tasks. Deep networks perform particularly well on large datasets. Interestingly, recent developments in parts of the deep learning community took renewed inspiration from the field of automata and formal models to improve RNN- and LSTM-based deep network for sequence prediction and transducing tasks. This isn’t the first time the two fields meet, see such as Giles et al.’s work from the early 90ies on neural stacks, but it is the first time deep networks are used in practice at large-scale, offering the best performance.

The key idea behind all proposals is to extend neural networks with memory managed by a controller. The controller, managing access and use to the memory, is built to be a differentiable operator (e.g. another kind of network with differentiable access operators). The resulting network can be trained using standard optimization algorithms and frameworks, benefiting from the same GPU acceleration other networks do, too.
To my limited knowledge, the increase in interest in these models came with Graves et al. at Deep Mind neural turing machines (NTM)., and Weston et al. at Facebook memory networks, roughly proposed at the same timeBoth approaches extend neural networks with a read-write memory block. While the NTM paper focuses on program inference and solving algorithmic tasks, the memory network paper focuses on increasing performance on language problems. Since other blogs already offer nice high-level summaries on NTMs and memory networks, I will not go into more details. AMoreover, at this year’s nampi workshop at NIPS, Graves extended the idea of memory access by additionally learning how many computation steps are required to finish computation and output a decision.

The paper I am focusing on is Learning to Transduce with Unbounded Memory by Grefenstette et al. The paper’s goal is to provide a middle ground between the fully random access memory of NTM and the static memory of RNNs. The abstract says:
Recently, strong results have been demonstrated by Deep Recurrent Neural Networks on natural language transduction problems. In this paper we explore the representational power of these models using synthetic grammars designed to exhibit phenomena similar to those found in real transduction problems such as machine translation. These experiments lead us to propose new memory-based recurrent networks that implement continuously differentiable analogues of traditional data structures such as Stacks, Queues, and DeQues. We show that these architectures exhibit superior generalisation performance to Deep RNNs and are often able to learn the underlying generating algorithms in our transduction experiments.
The key data structure implemented is a “continuous” stack. Its read- and write-operations are not discrete, but on a continuum in (0,1), modeling the certainty of wanting to push or pop onto the stack. The data objects are vectors. The stack is modeled by two components: a value matrix V, and a strength vector s. The value matrix grows with each time step by appending a new row and models an append-only memory. The logical stack is extracted by using the strength vector s. A controller acts on the tuple of value matrix and strength vector (V, s). It takes in a pop signal u, a push signal d, a value v, and produces an (output) read vector r. The quantities u and d are used to update the strength vector s, whereas v is appended to the value matrix V, and the read vector r is a weighted sum of the rows of the value matrix V.
The following figure illustrates the initial push of v_1 onto the stack, a very “weak” push of v_2, and then a pop operation and another push operation of a value v_3 (you can find the exact equations and rules to modify s and read r are stated in the paper).

The next figure illustrates the setup: the memory at the center and the controller input values d for pushing, u for popping, and the value v. Moreover, the previous value matrix and previous strength vector are used. The outputs are the next value matrix and strength vector as well as the read vector. This construction a differentiable memory block containing a stack. But there are no free parameters to optimize its behavior. By viewing the previous value matrix, strength vector, and read vector as a state output of an RNN that receives an input vector i, the authors obtain a trainable system with free parameters.

But what advantage does such a system offer? To determine its effectiveness, the authors consider several simple tasks (copying a sequence, reversing a sequence, and inverting bigrams in a sequence) and tasks from linguistics (using inversion transduction grammars, a subclass of context-free grammars). The network enhanced with a stack is compared to a deep LSTM network. Overall, the stack enhanced network not only performs better but also converges faster.

Unfortunately, the authors don’t provide an analysis of the stack usage. I think it would be interesting to see how the LSTM controller learns how to use the stack and compare the results with traditional pushdown automata. In grammatical inference, the usual goal is to find the smallest possible automaton. How different is this goal from learning a stack-enhanced LSTM? Can we understand the model, and does it offer some insight? The ability to interpret automata (and their use as a specification language in formal systems) is a huge motivating factor for our own work (see e.g. our paper on interpreting automata for sequential data). What can we learn from others?
Disclaimer: The graphics are taken from the research paper; Chomsky hierarchy from wiki commons (by Dnu72)
This blog post was first published on

Mittwoch, 16. Dezember 2015

A blog post about blog posts about NIPS

After this year's NIPS conference, a lot of people wrote summaries, pointed out their personal take-aways or favorite sessions, or shared their session notes.

Here's a (totally incomplete, and randomly ordered) list of blog posts I noticed; please feel free and comment with a link to your own post so I can add yours to the list. I will continue to expand and complete it.

I haven't had time to read and work through all of them, but I will try to add short summaries and author information to this blog post as I go through them.

Mittwoch, 30. Juli 2014

Integrating Machine Learning Models for Real-Time Prediction into your Existing Workflow (using openscoring and PMML)

In today's world, understanding customers and learning from their behavior is a key component in a company's competitive edge in the market. This not only refers to lower user-retention costs in marketing through intelligently timed re-engagement and a higher lifetime value of users through clever item recommendation systems, but also extends to lower operating costs and a better user experience through modern risk and fraud management: discovering fraudulent participants in marketplaces or payments at risk are vital to the overall performance.

Especially for established companies outside the mobile- and web-service space, adopting new practices and integrating lessons learned from a thorough data analysis can be hard. Established work flows and running systems need to be changed, which can often be a painful experience---especially when outside consultants are hired to conduct the initial study and viability analysis.  Their tools might not be a good fit for the company's established stack.

A good solution to integrate a new layer of data mining and machine learning models is a middleware layer, such as openscoring. It runs independently and can be accessed through a REST API it provides. Existing software does not need to extended with new libraries for data analysis, but only needs to be able to communicate via HTTP, passing on XML requests---a very low bar most systems will pass without any additions---and communication that can be implemented without dedicated machine learning specialists.

The machine learning models can be created offline, in a variety of languages such as Python or R. openscoring. A XML description language, the predictive model markup language (PMML), is then used to deploy the model on a server in the cloud. There is even a heroku ready version that can be set up with a couple of lines of code in a matter of minutes.

In the following, I will outline how a model created using the statistical language R, e.g. by a group of consultants or a in-house team, is deployed as a service, ready to be integrated in your existing frameworks.

The example is very simple: Linear regression, i.e. fitting a linear function to given data points such that a given error function is minimized. You can just open your RStudio or any other environment for R and try it out yourself.

The package for R is called pmml and can be installed using the command
> install.packages("pmml")
There is good documentation available online. Since the native output of the package is XML, make sure you have the XML library installed.

The following code snipped creates a linear regression model for a data file on my web server; please beware that it omits a number of important steps (like dealing with missing data, or normalizing the data). But it suffices to give a rough idea: A model is created and fed to the pmml-function, which in turn creates a XML description. We store the description in a file named glm-pmml.xml.

rawDataDF <- read.CSV("")
rawDataDF <- na.omit(rawDataDF)

target <- rawDataDF$TARGET_Adjusted

N <- length(target)
M <- N-500

data.trainingIndex <- sample(N,M)
data.trainingSet <- rawDataDF[data.trainingIndex,]
data.testSet <- rawDataDF[-data.trainingIndex,]

glm.model <- glm(data.trainingSet$TARGET_Adjusted ~ ., data=data.trainingSet, family="binomial")
glm.pmml <- pmml(glm.model, name="GLM Model", data=data.trainingSet)

xmlFile <- file.path(getwd(),"glm-pmml.xml")

After creating the model and storing it in a PMML file, the next step is its deployment. There are two choices: 1) the model can be uploaded via the REST interface, or 2) it can be given as a command line parameter.
1) The request using the command line tool curl just PUTs

> curl -X PUT --data-binary @glm-pmml.xml -H "Content-type: text/xml" http://localhost:8080/openscoring/model/GLMTest

2) Via command line

> java -cp client-executable-1.1-SNAPSHOT.jar org.openscoring.client.Deployer --model http://localhost:8080/openscoring/model/GLMTest --file glm-pmml.xml

The setup of openscoring itself is straightforward and uncomplicated. Either clone the git on github and deploy it directly to heroku, or download and install it locally---the documentation of openscoring as well as Maven provides step-by-step instructions.

Using a running instance of openscoring with its model is simple: Just send requests via HTTP. For the sake of simplicity, we can just feed back the whole CSV file we used for training:

> curl -X POST --data-binary @simple_model.csv -H "Content-type: text/plain" http://localhost:8080/openscoring/model/GLMTest/csv

The answer will be a list of input-output values. Instead of using curl to send requests via the command line you can easily integrate the API with your existing software projects, e.g. to receive a score to evaluate the likelihood of fraudulent offers on your marketplace. A prominent user of openscoring is AirBnb: The young company uses decision tree models employed on openscoring servers to evaluate and catch fraudulent bookings in real-time.

Are there any drawbacks to this approach? Yes, in some cases: since the machine learning models need to be supported in the PMML language, the newest ideas presented in research papers cannot directly moved into production with openscoring and PMML. But for vast majority of use cases, this certainly does not matter a lot: while new models often have a slight edge in their specific application area as presented in papers, the transfer to a company's application will not automatically translate into the same performance advantage over traditional models. The amount of fine tuning necessary to have any advantage will outbalance any disadvantage a slightly older machine learning model will have.

Stay tuned for my follow-up article covering the use of PMML for data crunching using Hadoop.

Samstag, 12. Oktober 2013

A bunch of questions about founding a start up comany

The whole internet seems to be overflowing with advice for start ups. Experts and industry veterans share a wide range of information on twitter and blogs, at meet-ups and conferences. They tend to focus on marketing, product development, metrics and financing. Much of this information is contradictory (and much of the information regarding finding investors seems sketchily focused on “luring” them in, rather than rightfully gaining their trust.)
For all the abundance — and diversity — of advice for start ups, I have not yet been able to find anything on more practical, immediate problems, particularly facing start ups in Berlin or Germany more broadly.Assuming that I am already running a landing page to catch some early adopters, and giving them access to a functional prototype/alpha for special conditions to test out and develop my product, I still have questions that I’d organize into two main groups: 1. Pre-incorporation logistics and 2. Legal and tax details of the founding process. 

1. Pre-incorporation logistics:

  • At what point should I legally found my company? Only once I am certain that my product and business model will work out? Or should I incorporate before I handle any data from businesses, or from private customers?
  • If I were to collect data before legally founding my company, how can I legally transfer the prototype and the data to the company?
  • In Germany, having an imprint on your website and following data protection laws is very important. Failure to correctly comply can easily lead to notices from lawyers paid by competitors - without a limited liability company, how can I protect myself? 
  • How do I handle expenses I incur before officially founding my company? What should I do about them if my plans do not materialize before legally founding? Are there any tax breaks that I can claim from pre-founding investments or expenditures? 
  • Can I "hire" any help in advertising/marketing before founding a company?

Assuming I decide to start a German UG company (basically a limited, with some more drawbacks), this leads to another group of questions.

2. Legal and tax details:

  • How much collateral beyond the first necessary Euro do I need in reality? Does that have any practical impact on how my company can interact with other companies?
  • What other costs can I avoid? I have to pre-pay taxes, but how should I go about making good estimate of what to pay? Should I try to over or underestimate? I can avoid the IHK membership fee at the beginning?
  • How should I handle expenses I privately financed before founding, e.g. hosting a landing page, some marketing costs, money spent to research the market and buy publications?
  • How much money should I put into the company to keep it solvent, but not have more unused liquidity in it than my personal bank account can't handle? How about adding more money later on, especially when there are multiple founders? How to balance stake vs money put in?
  • In case I suspect that my company will be bought at some point, should I have a holding to hold the actual company, e.g. to avoid the early payment of taxes?
  • If I plan to have investors, do I need to take any preparations to easily enable parts of the company owned by third parties?
  • Should I use the "Musterprotokoll" when founding my company or should I have a custom one made? Should I use a service provider like go-ahead or to found the company, or  do everything on my own?
  • Should I found have a German company, or an English, American, etc. one? Especially in software and SaaS, being some other place may be easier tax- and data protection-wise.

I am aware that a lot of these questions depend on what my specific enterprise will do and I will need to talk to someone professional - but who should I talk to?

Dienstag, 27. August 2013

A quick overview of simple encryption for end-users

Recent revelations regarding PRISM and other US (as well as UK and EU) government agencies systematically collecting and processessing private information has sparked many conversations about encryption across socialnetworks. So, here is a quick overview of a few free, (mostly) open source programs that enable everyone to use high quality encryption. The encryption schemes used by the programs listed below are considered uncrackable and require, using the currently available computing power, billions of years to crack.

To encrypt your hard drive and/or private data files, I highly recommend to use TrueCrypt. It can encrypt your entire hard drive, including your operating system. Or, it can create a virtual disk that when you enable and look at it in your explorer will look and function just like a USB drive -- only, everthing you put onto the disk will be still on your harddrive, but now stored in an encypted file. Option #1, The Encrypt Everything Solution requires you to provide a password when you boot up your computer. The TrueCrypt manual provides more information on how to set up encrypted devices or fine tune some of the parameters available.
It is also possible to use truecrypt on android phones with EDS, and if the phone is rooted, it is possible to share encrypted files via Dropbox between smartphones and other devices.

To encrypt emails, as well as single files, public key cryptography is the go-to choice. This works by having every users generate two keys, a public one that anyone can know, and a private one that must stay secret. The public key can be used to create encrypted messages for the owner of the key (think of this like putting a lock onto the message) -- it can only be decrypted using the private key (the key to the lock that you put on the message). The private key can be used to create signatures of documents. This enables third parties to reliably check whether a text was authored, and is unmodified, by the owner of the private key.

Popular software that integrates this form of encryption into your email program in the openPGP standard are

These programs come with an extensive, 190 page manual, although using the program is really simple and easy due to the plugins for Thunderbird and Apple's mail program.

To encrypt chat sessions, I highly recommend using OTR. It is a plugin for pidgin (windows, linux) and adium (mac), an opensource chat client that supports a wide range of chat protocols, e.g. google talk, msn, facebook, xmpp, icq, and many more. It is also available in gibberbot for Android. Using the skype4pidgin plugin in pidgin or adium, it is possible to encrypt the text chat of skype, too. Unfortunately, it does not help to secure voice communication. There are some guides, including video tutorials for set up.

(Another interesting feature: the plugin not only offers encryption, but also enables plausibly deniablity. If used correctly, by just looking at the encrypted stream of exchanged messages, it is not possible to attribute authorship to the messages.)

For push notification based instant messaging on smartphones, I recommend Threema. The system I described above (OTR) requires users to be online to send messages - but whatsapp and others do not require both parties to be online at the same time. Threema avoids this issue by using a scheme similar to the public key encryption in emails.

Unfortunately, it costs a couple of bucks (~1.8 €), but has no further charges. It basically is an encrypted version of whatsapp. The only drawback is, of course, no intercommunication with other messengers. It is also closed source, meaning it is hard to check what the app actually does, but the cryptography is provided by an open source library NaCl.

Surfing anonymously is a bit harder, and there is a choice of VPN providers. They all are centralized, i.e. there is one party that knows everything you do. There are a few peer to peer solutions, and the best and probably most famous on is the TOR project.
The Tor project offers a a lot and is actually used in China, Egypt and other Arabic countries to enable and protect members of the opposition. There is an amazing talk "How governments have tried to block Tor" by one of the creators, describing the arms race with China and other countries on en-/disabling access to the TOR network - once a user is in, it is almost impossible to track his activities.