In linguistic applications, tasks typically are translating a sentence, or deciding whether a given string belongs to a specific language. In the past, popular models to learn rules were finite state machines, pushdown automata, and hidden Markov models. We understand these models fairly well, and they each describe a class in the Chomsky hierarchy. This makes them very apt to model formal systems. But when it comes to describing natural language and solving problems in NLP, the rules imposed by formal grammars are often too strict and limited to model human writing and speech.
In recent years, neural models outperform automata models, especially deep networks, when solving real-world tasks. Deep networks perform particularly well on large datasets. Interestingly, recent developments in parts of the deep learning community took renewed inspiration from the field of automata and formal models to improve RNN- and LSTM-based deep network for sequence prediction and transducing tasks. This isn’t the first time the two fields meet, see such as Giles et al.’s work from the early 90ies on neural stacks, but it is the first time deep networks are used in practice at large-scale, offering the best performance.
The key idea behind all proposals is to extend neural networks with memory managed by a controller. The controller, managing access and use to the memory, is built to be a differentiable operator (e.g. another kind of network with differentiable access operators). The resulting network can be trained using standard optimization algorithms and frameworks, benefiting from the same GPU acceleration other networks do, too.
To my limited knowledge, the increase in interest in these models came with Graves et al. at Deep Mind neural turing machines (NTM)., and Weston et al. at Facebook memory networks, roughly proposed at the same time. Both approaches extend neural networks with a read-write memory block. While the NTM paper focuses on program inference and solving algorithmic tasks, the memory network paper focuses on increasing performance on language problems. Since other blogs already offer nice high-level summaries on NTMs and memory networks, I will not go into more details. AMoreover, at this year’s nampi workshop at NIPS, Graves extended the idea of memory access by additionally learning how many computation steps are required to finish computation and output a decision.
The paper I am focusing on is Learning to Transduce with Unbounded Memory by Grefenstette et al. The paper’s goal is to provide a middle ground between the fully random access memory of NTM and the static memory of RNNs. The abstract says:
Recently, strong results have been demonstrated by Deep Recurrent Neural Networks on natural language transduction problems. In this paper we explore the representational power of these models using synthetic grammars designed to exhibit phenomena similar to those found in real transduction problems such as machine translation. These experiments lead us to propose new memory-based recurrent networks that implement continuously differentiable analogues of traditional data structures such as Stacks, Queues, and DeQues. We show that these architectures exhibit superior generalisation performance to Deep RNNs and are often able to learn the underlying generating algorithms in our transduction experiments.The key data structure implemented is a “continuous” stack. Its read- and write-operations are not discrete, but on a continuum in (0,1), modeling the certainty of wanting to push or pop onto the stack. The data objects are vectors. The stack is modeled by two components: a value matrix V, and a strength vector s. The value matrix grows with each time step by appending a new row and models an append-only memory. The logical stack is extracted by using the strength vector s. A controller acts on the tuple of value matrix and strength vector (V, s). It takes in a pop signal u, a push signal d, a value v, and produces an (output) read vector r. The quantities u and d are used to update the strength vector s, whereas v is appended to the value matrix V, and the read vector r is a weighted sum of the rows of the value matrix V.
The following figure illustrates the initial push of v_1 onto the stack, a very “weak” push of v_2, and then a pop operation and another push operation of a value v_3 (you can find the exact equations and rules to modify s and read r are stated in the paper).
The next figure illustrates the setup: the memory at the center and the controller input values d for pushing, u for popping, and the value v. Moreover, the previous value matrix and previous strength vector are used. The outputs are the next value matrix and strength vector as well as the read vector. This construction a differentiable memory block containing a stack. But there are no free parameters to optimize its behavior. By viewing the previous value matrix, strength vector, and read vector as a state output of an RNN that receives an input vector i, the authors obtain a trainable system with free parameters.
But what advantage does such a system offer? To determine its effectiveness, the authors consider several simple tasks (copying a sequence, reversing a sequence, and inverting bigrams in a sequence) and tasks from linguistics (using inversion transduction grammars, a subclass of context-free grammars). The network enhanced with a stack is compared to a deep LSTM network. Overall, the stack enhanced network not only performs better but also converges faster.
Unfortunately, the authors don’t provide an analysis of the stack usage. I think it would be interesting to see how the LSTM controller learns how to use the stack and compare the results with traditional pushdown automata. In grammatical inference, the usual goal is to find the smallest possible automaton. How different is this goal from learning a stack-enhanced LSTM? Can we understand the model, and does it offer some insight? The ability to interpret automata (and their use as a specification language in formal systems) is a huge motivating factor for our own work (see e.g. our paper on interpreting automata for sequential data). What can we learn from others?
Disclaimer: The graphics are taken from the research paper; Chomsky hierarchy from wiki commons (by Dnu72)
This blog post was first published on automatonlearning.net