100 Research Papers in 100 Days (NLP Edition)

The pace of machine learning research is advancing at such a breakneck pace, the only way to keep up is to read every single day. So awhile back, 100 days ago to be exact, I decided to read one paper every single day - the first thing to tackle every morning. The organization is split up into two posts (NLP and Images) to make a gigantic blog post into two really large ones. Please note, this is written by a student and may contain errors. Without further ado:

Noise Contrastive Estimation and Negative Sampling (Chris Dyer) [Read Sept XXth]

Idea: xyz
Lesson: foo

Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models (Luong, Manning) [Read Oct 1st]

Idea: Neural machine translation (NMT) has issues when it hits OOV words, and often resorts to predicting a < UNK > token. One solution is to track the position of an unknown words, and during decoding a dictionary lookup or identity copy can be performed to replace the < UNK > in the translation. Taking that one step further, the authors present a hybrid system that translates mostly at the word level, but then consults the character components when it comes across rare or unknown words.
One might argue that if the model is using character-based tools to get around OOV words already, why not just use that everywhere? The answer is simple, by keeping the word-based translation portion, the model trains faster. Also, the word-based portion is sometimes better at translating proper nouns, such as names and locations. For example, in a word-based method, “Martin Luther King” needs a look-back of 2 words to realize that “King” is a very likely third word. On the other hand, a character-based method would need a look-back of 13 time steps to realize “K-i-n-g” refers to a name and not the reigning person.
Results: On the WMT’15 English to Czech translation task, this hybrid (ensemble) approach offers a 20.7 BLEU, which is a boost of +1.9 points over last year’s winning entry. Measured, by chrF3, the hybrid model also achieves the SOTA with 47.5 points.
Lesson: Since character sequences are so long, having attention is absolutely critical.

Neural Language Correction with Character-Based Attention (Xie, Avati, Arivazhagan, Jurafsky, Ng) [Read Sept 28th)

Idea: The authors apply a Encoder-Decoder model (bi-directional GRUs) with attention to the problem of Natural Language Correction (i.e. spellcheck) with inputs at the character level (which helps with OOV words).
Then a Kneser-Ney smoothed 5-gram language model is used, in conjunction with beam search (with depth k=6), to find the most probable words in the sentence. “Since decoding is done at the character level, the language model probability is only incorporated after a space or < EOS > symbol is encountered.”
From the user’s point of view, an incorrect spelling suggestion is a big deal, while occasionally missing a spelling mistake isn’t such a big deal. This translates to a situation where higher precision is more critical than higher recall, so the model is calibrated such that precision is increased by adding a binary classifier at the end to predict whether the suggested edit is correct. Then, if the classifier deems the edited sentence to be incorrect above some threshold, the edit is rejected. This classifier is trained using edit distance features and divergence from word embedding features as the cost function.
Results: There are noticeable gains over previous rules based methods, going from an F-score of 35-37 to a new high of ~40.
Lessons: A common technique to increase data size is to augment with synthesized data. In this case, the data is created by article or determiner errors (ArtOrDet) and noun number errors (Nn).
ArtOrDet errors are cases where an article or determiner (such as “a” or “the”) is incorrectly deleted, replaced, or added. An example is “She went to library to check out an book” where “the” is missing and “an” is misused.
Nn errors are cases where the noun is made singular or plural when it shouldn’t have been changed. An example is “Japanese food tastes great since I like to eat raw fishes.”
A pyramid scheme is used to speed up training, which is especially useful since characters stretch out the length of inputs by ~5x.

Recurrent Highway Networks (Zilly, Srivastava, Koutník, Schmidhuber) [Read Sept 27th]

RNNs always have a vanishing and exploding gradients problem. To get a sense of the theoretical limits, the Geršgorin Circle Theorem is used to say that the joint combination of the eigenvalues (?) can be measured by this theorem, which tells us that it would be ideal for gradients to be close to an identity matrix.
In order to deal with varying spectrum of gradients, the authors create a new architecture called Recurrent Highway Networks (RHN) that allows some more flexibility over the variance by adding gated highway networks. In particular, they propose a model where frequency of the recurrence is greater than the frequency of the inputs, and thus they hope these recurrences will learn to control the spectrum of gradients.
The intuition is that the highway networks allow the gradients to flow back more smoothly, rather than a normal RNN where the gradients would explode or vanish immediately. Successful training of the network is possible with this extra control, but the model doesn’t really have a reason why it would magically do so.
Results: Based on an update, the language model is down to ~64 perplexity on PTB, which is state of the art as of Sept ‘16.
Lesson: The authors spend a lot of time on the Geršgorin Circle Theorem, but really only use that to help initialize the weights in the model. They don’t really use the theorem to inform the structure of their new network. Ultimately, this just seems like a slightly more complex version of the highway network, which is not a surprise since this is the same group of people. There is probably going to be a simpler (and better performing) ResNet version of this in a few months.

Sentence Level Recurrent Topic Model: Letting Topics Speak for Themselves (Tian, Gao, He, Liu) [Read Oct 2nd]

Idea: Topic models identify patterns in a corpus in order to figure out its relevant topics, which is useful for downstream tasks such as translation or document classification. As an example, given a corpus with 100 sentences, and a choice of k=7 topics, the topic model will label each of the sentences with a percentage score for each of the seven topics, adding up to 100% (i.e. 30% Topic 2 and 60% Topic 5, 10 % to Topic 7, and 0% for all remaining topics).
Traditionally, algorithms in this realm (such as LDA) do not take into account the sequential order of words or their topic coherence (and instead consider sentences more as unordered Bag-of-Words). Taking advantage of recent advances with RNNs that handle sequential data quite well, the authors propose Sentence Level Recurrent Topic Model (SLRTM) that assumes the generation of each word within a sentence to depend on both the topic of the sentence and the whole history of its preceding words in the sentence. Additionally, SLRTM enforces topic coherence from one word to another by forcing all the words in each sentence to take on a single topic label (i.e. 100% Topic 5 and 0% everything else).
Results: The new model is benchmarked on two types of tests (a) Generative model evaluation as measured by test set perplexity and (b) Document classification as measured by accuracy. The final results seem underwhelming. Whereas, LDA and other methods achieved perplexity of ~950 to 1000, SLRTM has perplexity around 400. However, the SLRTM performs worse than a standard LSTM language model “with a small gap of about 100”. If the model is worse, why is it useful?
Presumably, you could use a SLRTM (or LDA) as a form of pre-training, such that the topic vector it produces is fed into a Seq2Seq model in addition to the original sentence, in order to build a stronger language model. However, I am skeptical all the extra training time required would be worth the effort.
Additionally, the SLRTM weight vectors are used as inputs into a logistic regression model for document classification. Again, the results look decent, but why not just train a single end-to-end classifier?

Semi-supervised Sequence Learning (Dai, Le) [Read: Oct 5th]

Idea: Pre-training makes supervised tasks work better. Here, the authors propose two methods that support semi-supervised learning. The first method is to use a sequence autoencoder (SA) whereby a series of tokens are passed into a encoder-decoder network to predict the same series of tokens. This is basically the same as a typical auto-encoder except that instead of a fully-connected neural network (where the weights are encoded in the hidden layers), the new method uses LSTMs.
A second method is pre-training using recurrent language model (LM). Whereas the SA feeds in words from the source document one at a time for each time step, the LM removes the encoding portion of the RNN. Thus (presumably), the words are not added one at a time, but rather all at once as a concatenated document vector. The authors don’t actually spend a lot of time describing this in detail, and I guess it doesn’t really matter in the end because the performance of the LM-LSTM is worse than the SA-LSTM.
Furthermore, the authors show that transfer learning is possible such that pre-trained vectors for one dataset can actually be useful for classification of a different dataset (if the corpus exhibits similar breakdown of words).
Results: Testing on IMDB and Rotten Tomatoes sentiment analysis, the SA-LSTM beats all other methods. LM-LSTMs are also competitive, but not as good. Testing on NewsGroup 20 and DBPedia, the SA-LSTM also does a good job, beating others by 5-10% error.
Lessons: Skip-thoughts try to create a vector embedding by directly trying to estimate the previous and following sentences. While this has yielded some success, building upon more fine grained information (words or morphemes) should yield better results, as seen in this paper.
Since the datasets were relatively small (10k to 20k documents) and the networks were large (1024 hidden units), severe embedding dropout was used (70% to 90% range)!
Try this at home - these methods initialize the weights of the network; they do not play a role in altering the inputs to the network. Thus, for more non-random initialization fun, use word2vec or GloVE rather than one-hot encoded word vectors.

Multilingual POS Tagging with Bidirectional LSTMs and Auxiliary Loss (Plank, Søgaard, Goldberg) [Read: Oct 9th]

Idea: Bidirectional LSTMs have shown great promise for a variety of NLP tasks. However, there hasn’t been a good survey of all the different tasks in direct comparison to one another. In this paper, the authors build a POS-tagger across different input representations (word, character, unicode bytes), different langauges (English + 21 others), different data set size (100 vs 500 vs 1000 sentences); and different levels of label noise (0% - 40% corruption).
Separately, they also introduce a new auxiliary loss function which they believe is better at handling rare words. The model jointly predicts the POS and the log frequency of the word. The intuition is that word frequency is a direct measure of rare vs common words, so adding this feature allows the model to distinguish between the two and discourages shared representations of rare vs common words.
Results: The word frequency model (FreqBin) does barely better than the standalone model, such that it is probably not worth the extra hassle to incorporate, unless you expect there to be lots of OOV words in your data.
Lesson: Word+Char hierarchical model with pre-trained word embeddings work best, this was roughly true regardless of which language was being tagged. In regards to data size, bi-LSTMs start out a little worse than Markovian models, but soon takeover in performance with more training examples (as few as ~500 sentences). In regards to noise, TNT Markovian models perform better when there the labels are severely corrupted (30%+ corruption), but generally speaking both types of models degrade at about the same rate.

Memory Networks (Attention, LSTMs, NTMs)

Improvements in performance by augmenting memory using any variety of methods, including implicit or explicit storage.

Neural Turing Machines (Graves, Wayne, Danihelka) [Read Oct 21st]

Idea: Memory Networks are cool because they allow for relatively unlimited storage capacity by a neural network before it needs to produce an output. But you know what’s even cooler than selectively reading from an external memory store? Being able to selectively read from and write to an external memory store, which is precisely what the NTM is able to accomplish. In doing, the NTM pushes deep learning models past finite state machines and pushdown automata, and into the realm of Turing machines, hence the name Neural Turing Machine.
The NTM has four general components:
External Memory: This is the storage of all the inputs represented as vectors stored in “locations”. In practice the number of locations is fixed, but in theory it should not be. Since the number of locations (N) is fixed, the rotational shift for location-based addressing is calculated modulo N, so that any shift past the end goes back towards the beginning of the memory tape.
Controller: operates the read head and the write head. Reading is done using an soft attention mechanism which can choose to focus more sharply on a few memories, or focus more weakly on a larger number of memories, depending on the task at hand. Deciding when to write or erase a memory is controlled by their own individual weights, similar to the input and forget gates in LSTMS.
More precisely, in each update step, the controller performs a series of operations:
1. Addressing - In order to find out where a piece of content is stored, the model can perform content-based addressing or location-based addressing. Since certain problems are more suitable for one type against the other, the authors actually use both methods.
2. Interpolation - This represents how the content addressing method method looks for the next memory. In typical implementations, content addresses are formed using hashing algorithms (eg. SHA-1), but that would be overly complicated so instead the content comparison is done using cosine similarity (minimizing cosine distance). Before performing the calculation, an interpolation weight from (0,1) balances how much the model should care about the content weight from the current time step compared to weight from the previous time step.
3. Convolutional shift - This represents how the location addressing method looks for the next memory for reading or writing. Assume we have the read-head is at position X, and the next relevant memory is located at position Y. Then a properly trained model should rotate the read-head by precisely that amount (Y-X). This operation is governed by the “shift kernel” whose outcome tells the read-head to move a certain number of steps to the right or left in the memory bank.
4. Sharpening - since the weightings often come out as a percentages spread out across many memory locations, a sharpening process will force focus into fewer places. For example, if there are five memories to choose from with weights of (0.1, 0.9, 0.2, 0.1, 0.8), the model might sharpen into (0, 1, 0, 0, 1) to pick from only two memories instead.
By using these multiple-addressing mechanisms, the model is able to draw from the right memory in a number of ways:
1. Content key only: memory is accessed like an associative map.
2. Content and location: the content key finds a contiguous block of memory, and then shifts near that area to find the relevant memory. Probably most useful for associative recall.
3. Location key only: the shifting iterates a certain amount from the last focus.
Results: This type of architecture is unbelievably astounding (especially when it came out) because it means you can have a fully trainable computer! However, just like the earliest computers are not nearly as powerful as even the cheapest PC available on the market today, this first iteration of the NTM was used mostly for basic algorithmic tasks:
- Copying: the input sequence is random binary vectors and the goal is to repeat the pattern for a long time. For example, given a 20-step pattern, repeat that same output for 50 or 100 time-steps. While the LSTM could do the 20 steps pretty well, but degrades shortly thereafter. The NTM could perform decently well up to 100 steps since it has the ability to effectively store the entire binary vector and retrieve it when necessary.
- Associative recall: the training sequence might be a series of vectors and the query would be one item in that pattern. Then the desired output is the next item in the series that comes right after the input. For example, given a training pattern of (3,7,5,3,7,5,3) and a query of 5, the output should be 3. Given a query of 7, the output should be 5. NTM is able to accomplish this as well, although not perfectly.
- Sorting: the training sequence is a 20 binary vectors with associated priorities, and the target output would be returning those vectors in sorted priority from highest to lowest. The NTM is able to perform this quite well, whereas the LSTM fails.
- (Repeat Copy and N-grams are skipped)
Lessons: In all cases, the NTM controller using the LSTM performs better overtime than the controller using a feedforward structure. Overall, while the tasks are simple, the fact that the system can do anything at all is absolutely tremendous!

Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes (Gulcehre, Chandar, Cho, Bengio) [Read Oct 22nd]

Idea: Neural Turing Machines (NTMs) are indubitably intriguing due to their unbounded expressive power, but in its first iteration, it was only capable of targeted tasks such as copying and associative recall. In this papers, the authors extend the limits of NTMs by introducing a learnable addressing mechanism and a specialized hard attention mechanism, allowing the NTM to challenge (but not surpass) state-of-the-art results on the FB BabI tasks related to Q&A. The authors also introduce a curriculum strategy for the attention mechanism that helped a lot.
At each time step, the controller will:
1. receives an input value (x)
2. addresses and reads some memory (m)
3. erases/writes a portion of the memory
4. updates its own hidden state (h)
5. outputs a value (y) if needed
Differing from the original NTM, each memory location now has two components:
1. a learnable address matrix (a) that can be updated throughout training
2. a static content portion (c) that stays constant for each episode of training. It is zeroed out at the beginning of each episode.
This sounds a lot like the Key-Value Memory Network in the sense that the key is the address and the value is the content of the memory.
The controller now needs to produce updated weights for reading (w), erasing (e), writing (u), and content (c). The model decides what to read based on m_t = w_t M_t−1, where M is all the memories. The model decides what to write based on m_t = (1-eu)m_t-1 + uc. Basically, note that writing is based on a tradeoff between choose to erase/write over memory, and chosing what to write into memory.
The paper introduces a memory addressing schema that can learn to put more emphasis on the least recently used (LRU) memory locations. Somehow, this makes it easier to learn pure content-based addressing?
Results: Using a 3-step discrete, GRU-based D-NTM, the model is able to average a 21.8% test error, which is significantly better than the LSTM of original NTM models that hover around 30% test error. The best standalone feed-forward D-NTM achieves 37.8% error. This is impressive because Facebook bAbI tasks are a huge leap over associative recall tasks. However, compared to the MemN2N (4.2%) and DMN+ (2.8%) models, the D-NTM still has a lot of catching up to do.
In the process of designing the D-NTM, the authors tried multiple variations that allowed them to push performance. In general, a feed-forward controller (even when trained on for up to 4x longer) operates worse than a GRU-based controller. The learnable addressing scheme helps, but probably not as much as one would hope. Discrete attention, which uses sampling through the REINFORCE algorithm, operates slightly better than the soft attention mechanism (which means we lose end-to-end differentiability).
Although the results above would not suggest it, the best outcome was a soft/hard attention feed-forward controller as set up through curriculum learning. Concretely, the model learns to use memory at the beginning mainly with soft attention as governed by a param (π). As the probability of using soft attention anneals, the model transition to using a discrete attention mechanism over time. This achieves an eye-popping 12.8% test error in the bAbI tasks, which is almost twice as good as anything other D-NTM!
Lesson: Sometimes, the right thing for the model to do at a certain time step is nothing. Thus, the authors have a NOP (no operation) cell where nothing happens if something is written to it.
The original NTM used multiple read-write heads to address multiple lookups of the data. This time around, the multi-hop mechanism from DMNs were used to allow a single read-write head to perform multiple lookups.

Memory networks [2] explicitly store all the facts, or information, available for each episode in an external memory (as continuous vectors) and use the attentional mechanism to index them when returning an output. On the other hand, neural Turing machines (NTM, [7]) read each fact in an episode and decides whether to read, write the fact or do both to the external, differentiable memory. A crucial difference between these two models is that the memory network does not have a mechanism to modify the content of the external memory, while the NTM does. In practice, this leads to easier learning in the memory network, which in turn resulted in it being used more in real tasks [8, 9]. On the contrary, the

The original NTM supports two modes of addressing (which can be used simultaneously.) They are contentbased and location-based addressing. We notice that the location-based strategy is based on linear addressing. The distance between each pair of consecutive memory cells is fixed to a constant. We address this limitation, in this paper, by introducing a learnable address vector for each memory cell of the NTM with least recently used memory addressing mechanism, and we call this variant a dynamic neural Turing machine (D-NTM). We evaluate the proposed D-NTM on the full set of Facebook bAbI task [2] using either soft, differentiable attention or hard, non-differentiable attention [10] as an addressing strategy. Our experiments reveal that

Associative Long Short-Term Memory (Danihelka, Wayne, Uria, Kalchbrenner, Graves) [Read: Sept 18th]

Idea: An alternative to attention mechanism for augmenting RNNs with extra memory, without adding extra parameters. Use a concept from Holographic Reduced Representations (HRRs) (Plate, 2003) to enable key-value storage of data. This is then slightly modified to use redundant storage to increase memory capacity and reduce noise in memory retrieval.
More specifically, the KV pairs are stored in an associative array, where each individual pair is the same size as the entire associative array, and the array is represented by the sum of the pairs. Suppose we are trying to store three values, then the keys will be 3 complex numbers (k = real + imaginary), the values will be three numbers (v), the associative array a = v1k1 + v2k2 + v3k3.
Since all the numbers are mixed together, the way to extract a certain value is to use the inverse of the key, or the complex conjugate. For example, to extract the value of v2, we use conjugate of k2. When done this way, the other remaining terms simply become “noise” that should end up with a mean of zero.
Further, the noise can be reduced by forming multiple associative arrays with different KV pairs (let’s say 20 of them). Then, when trying to extract v2, we pull the v2s from the 20 associative arrays and average them together.
Results: On tasks of (a) copying random digits (b) noticing closing XML tags (c) variable assignment and (d) simple arithmetic and (e) character prediction on Wikipedia, associative LSTMs performed at or near the best of all options.
Context: Authors followed up by comparing performance to a Neural Turing Machine. In these examples, an associative LSTM had a more stable decrease in cost with each minibatch, but both architectures generally converged around the same final cost.

Recurrent Dropout without Memory Loss (Semeniuta, Severyn, Barth) [Read: Sept 22nd]

Idea: Dropout from one layer to another in a feed-forward network is great for regularization, so applying the same idea to the time steps of a RNN should work as well. However, the straightforward method of randomly dropping some of the hidden state information being passed back to previous time steps defeats the purpose of the forget gate, whose job is precisely regulate this information flow, and causes LSTMs to perform less than optimally.
Thus, to circumvent this problem, the authors propose a “recurrent dropout” method that critically does not apply dropout to the hidden state (Gal 2015) or even the entire memory state (Moon 2015), but rather only applies dropout to the cell update vector during the calculation of the next internal memory state. This works because they drop only candidate updates (by manipulating the input gate), the previous memory states are always allowed to flow through unimpeded (since the forget gates and output gates are untouched). Note that this implies dropout masks are applied on a per-step basis (one word at a time), rather than a per-sequence basis (the whole sentence). Also, note that this level of dropout specificity is applicable only to gated architectures such as LSTMs and GRUs, and not to vanilla RNNs.
Results: With no dropout perplexity is around 130, with naive dropout perplexity is around 114, with recurrent dropout perplexity goes to ~103, and finally combining recurrent dropout with forward dropout leads to 91.6! Similarly awesome results are seen for character level modeling, NER, Twitter sentiment analysis. Notably, Seq2Seq tasks are not covered.
Lessons: “Models trained using the recurrent dropout scheme have slower convergence than models without dropout and usually have larger training error and lower validation errors. This behavior is consistent with what is expected from a regularizer and is similar to the effect of the feed-forward dropout applied to non-recurrent networks.”

Character-based Neural Machine Translation (Ling, Trancoso, Dyer, Black) [Read Sept 25th]

Idea: When performing NMT, word-level information often adds a critical source of bias because words are not necessarily the smallest unit of meaning. Thus, the authors introduce a translation model that views the input and output sentences as sequences of characters rather than words. Since any given language is made of only a limited number of characters (as opposed to words), the model provides several advantages:
- Capable of interpreting and generating unseen word forms
- Alleviates much of the challenges associated with preprocessing and tokenizing the source and target languages
- Computational speed and memory requirements are considerably lowered since (a) there aren’t as many options to search through and (b) there isn’t a need to store so many options in memory
These sentences are now translated in a hierarchical fashion such that characters are built up to create word representations which are then built up to create sentences. The end-to-end process can be viewed as having three major components:
- 1) A Character-to-Word model (C2W) encodes the input into words, which then are entered into the RNN at each timestep.
- 2) Attention mechanism decides which words are most helpful in determining the next word in the translation.
- 3) A Vector-to-Character model (V2C) selects the next character to output given the previous character and the probability distribution given from the attention mechanism, up until it receives a End-of-Word (EOW) token from beam search. As the model decodes each word, another beam search is being run to find plausible sentences.
Results: Character-based model achieves BLEU results that are slightly worse than, but overall on par with conventional word-based models.
The C2W model is more likely to select words that are spelled similarly more often the the word lookup table.
Lessons: Due to the difficulty of directly training the C2W model, a compositional model is introduced. To start, a word lookup table is used to maximize the translation score in the validation set. Then, the C2W model is trained to act the same as the word lookup table by minimizing the squared distance of their embeddings. Finally, the C2W model replaces the look-up table and and the full set of parameters (C2W, attention model and V2C) are fine-tuned again. The lesson here is recognizing that tough problems can be broken down into smaller, more manageable pieces.
IBM Alignment Models can be used to provide some weak supervision during training time to improve performance.

Listen, Attend and Spell (Chan, Jaitly, Le, Vinyals) [Read Sept 30th]

Idea: Hybrid DNN-HMM systems have shown significant gains in speech recognition over previous techniques, but the authors felt this could be improved by creating a model that utilizes only neural network components, and can thus be trained end-to-end. Listen, Attend and Spell (LAS) is a neural network that learns to transcribe speech utterances to characters (listen), and using an attention-based RNN (attend) will decode those inputs into characters, which subsequently is turned into words using beam search (spell).
Of particular interest is the use of the new pyramid bidirectional long short-term memory (pBLSTM) architecture during the “Listen” portion, since the “Spell” portion of LAS is a standard decoder RNN with attention. The pBLSTM cuts down on the effort needed to decode the hidden vector representation by reducing the number of time steps that the attention model has to extract relevant information from. Whereas the original acoustic signal inputs might be T timesteps long, the encoder cuts this down to U timesteps, where U < T. More concretely, for each layer (L) in the pyramid the number of timesteps is divided by 2, resulting in a timestep count of U = T/(2^L).
The pBLSTM achieves this speed up by simply concatenating the outputs at consecutive steps of each pyramid layer before feeding it to the next pyramid layer. Joining the previous input vectors might seem like a poor idea because it reduces the resolution of the data, but (possibly) like the max-pooling layers in ConvNets, key features are still carried forward and gradients still flow freely backward, so it doesn’t end up causing any major issues.
Another counterintuitive outcome is it might seem that having fewer options to choose from (U) would yield inferior results, but the authors found exactly the opposite to be true. They hypothesize that when there are too many inputs to choose from (T) the Attention mechanism doesn’t know where to focus and thus hard time extracting the relevant information, much like a shopper might be lost trying to choose among 30 different brands of toilet paper rather than just picking from the top 3 options.
As one clear benefit, the deep encoding architecture allows the model to learn nonlinear feature representations of the data.
Results: On some Google voice search tasks, LAS+LM achieves a word error rate (WER) of 10.3% with top 32 beams. In comparison, state-of-the-art CLDNN-HMM model achieves a WER of 8.0%.
Lessons: Attention mechanism is good; Character-based encoding rather than word-based encoding is good (especially when words sound the same but are spelled totally different such as “triple a” vs. “AAA”); pyramid scheme is good; sampling schemes to lower computational complexity are good (scheduled sampling was used in this case).
(Opinion) The way the two vectors are joined might be something useful to explore. Rather than concatenation, perhaps a researcher could try averaging the two vectors or taking the max value at each dimension, so the final vector is the same size. Also, I wonder how dropout could possibly be applied to a pyramid scheme.

Sentence-Level Grammatical Error Identification as Sequence-to-Sequence Correction (Schmaltz, Kim, Rush, Shieber) [Read Oct 6th]

Idea: Apply Seq2Seq model with attention to sentence-level grammar correction. They didn’t really come up with a new architecture or idea, but rather won a competition on a new dataset, which I guess warrants a paper. One interesting thing they did do is add a CharCNN component in before passing their word vectors into the encoder. The way it works is that whereas a image has columns and rows corresponding to the height and width of image with pixels as values, the word has columns as one character in each word, and the rows are the n-dimensional character embeddings. So for example, if the word is “water” and we chose each letter to be a 30-dimensional vector, then the CNN would have a size of 5x30. Then some convolutions with 3x2 filters are performed, passed to a highway network, and finally as input into a LSTM.
Results: AESW 2016 is a binary classification task asking “Did this scientific article contain any grammatical errors?” This team won first place, with F-score of ~63% vs. %45 for random guessing and ~61% for second place team.
Lessons: once again, for competitions, ensembles work best. This time, it was three character-based RNN, one word-based RNN, and a sentence-level CNN that did the trick.
CharRNN performed better than the word-based model, likely because it was able to pick up on orthographic errors. This might seem like overkill, since knowing the error is not needed for a binary classification task, but the authors found initial evidence that training at the lowest granularity was more useful than simply training against the binary label.

Dynamic Memory Networks for Visual and Textual Question Answering (Xiong, Merity, Socher) [Read Oct 16th]

Idea: DMN+ improved its power by incorporating a number of ingenious enhancements from MemN2N while still preserving the best parts of the original DMN.
Results:
Lesson:

Dynamic Memory Networks for Natural Language Processing (Kumar, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus, Socher) [Read Oct 8th]

Idea: Most tasks in NLP can be cast into Question Answering (QA) problems over language input if you squint hard enough. For example, a sentiment analysis task can be rewritten as the question “What is the sentiment of this passage?” Consequently, the authors believe that if a network is built that can handle generic questions and produce coherent answers, then most NLP problems can be resolved. As a start of that endeavor they propose a Dynamic Memory Network (DMN) that takes in a question along with arbitrary input sequences to form episodic memories, and generates a relevant answer. More specifically:
- 1) Compute a memory representation for all input sequences
- 2) Search the memories for relevant facts through an iterative attention process using the question representation as the initial trigger
- 3) Retrieve those facts and reasons over them to generate a vector representation of all relevant information
- 4 ) Process the question in conjunction with the relevant facts vector using a hierarchical RNN to provide the final answer
As an advancement over memory networks of the time (MemNN), the DMN relies exclusively on trained word vector representations and input-question-answer triplets, and notably does not include any supervision of which input embeddings the model should pay attention to for generating a response. Even compared to the end-to-end model from FB, the DMN offers a couple of differences:
- First a quick note on terminology, MemN2N refers to the stored objects as “memories”, while DMN refers to them as “facts”. MemN2N refers to each peek at the memories as “hops”, while DMN uses the term “episodes”.
- Input Module: Whereas MemN2N summed up the word vectors to create a sentence embedding, DMN uses a GRU sub-network to generate sentences. For cases when the input is a single sentence, each word in that sentence forms one fact (c). When the input is multiple sentences, each sentence in that document forms one fact. In the MemN2N, one memory (m) always refers to one sentence.
- Question Module: a GRU sub-network sharing the same word embedding matrix is used to process in the question. Unlike the input module, the question module only has one final output, which is the final hidden state of the encoder. MemN2N call the output (u), DMN calls it (q). Unanswered - do the two modules use the same RNN weights?
- Episodic Memory: Whereas MemN2N had multiple stacks of memories, one for each hop it plans to perform, DMN has one memory module that iteratively updates itself for each hop. In its general form, the episodic memory module is comprised of an attention mechanism as well as a recurrent network with which it updates its memory. The DMN creates an episode as a function of facts and questions. [e = f(c, q)]
- Next, multiple hops over the memory are performed to generate the final output vector. During each hop, the DMN creates an output as a function of its previous memory and the current episode output [m_t = f(m_t-1, e_t)]
- Answer Module: Finally, the answer module takes as input (1) the previous hidden state (2) the question embedding and (3) the previous output word. This is then all processed through a GRU, multiplied by a weight matrix and finally passed through a softmax. More details on implementation in a future blog post.
Results: The authors test DMN on a variety of tasks:
- Question/Answering: Gets most BabI tasks correct (93.6% mean accuracy), and is comparable to MemNN. DMN does better in some of the 20 tasks, and worse in others, which is to say it doesn’t blow the competition out of the water, but there is clearly a precedence being set.
- Part-of-Speech Tagging: DMN is roughly the same as the best models for POS tagging on WSJ-PTB dataset, with 97.5% test set accuracy. Putting together an ensemble pushes them over the top with 97.56% accuracy.
- Sentiment Analysis: The Stanford Sentiment Treebank (SST) dataset can be seen as two tasks (1) fine-grained root prediction with 5 classes or (2) binary root classification with 2 classes. DMN achieves 52.1% and 88.6% test accuracy for the two tests respectively, which is SOTA, with most other baselines around ~48% and ~85%.
Lesson: More hops make the model perform better, especially on BabI tasks set up specifically for multiple passes, such as those requiring transitive reasoning. Increasing the number of passes also slightly improves the performance on sentiment analysis, though the difference is not as significant.

End-To-End Memory Networks (Sukhbaatar, Szlam, Weston, Fergus) [Read Oct 11th]

Idea: Facebook researchers design a RNN network with access to an arbitrarily large external memory (in practice, max of 320 memories). Multiple computational hops are used with an attention-like mechanism to yield the right information at the right time. Also, unlike the original version, this model is trained end-to-end, which effectively means rather than being told explicitly which memories are relevant, a softmax is used to distinguish between useful sentences and distractions.
The model writes all inputs into memory up to a fixed buffer size using an embedding vector A. Writing the question into the model uses embedding vector B. Writing the output uses embedding vector C. Determining the final answer uses matrix W. Vocabulary is of size V = 177, which is ridiculously small compared to typical language modeling tasks.
In order to generate the embeddings for the different sentences, the values of all the words in the sentence are summed together, rather than concatenated, which helps preserve dimensionality, even when the sentences have different lengths (6 words vs. 9 words). The position of each word within the sentence is also added as a feature through a process called “positional encoding”.
Moreover, the authors also employed a technique they called linear start training whereby the softmax in each memory layer removed, making the model entirely linear except for the final softmax for answer prediction. Then, when the validation loss stopped decreasing, the softmax layers were re-inserted and training recommenced (sounds like pre-training with auto-encoders).
The authors noticed a lot of variance in results, so they repeated each training 10 times with different random initializations, and picked the one with the lowest training error to deal with the issue (sounds like trying to solve the same problem as batch normalization). There are about ~7 other minor hacks they put in to make the whole system work better.
The input module is calculated as the softmax of the inner product of the question embedding (u) with the memory embeddings (m). In a hand-wavy way, it asks the question “how relevant is this memory for answering my question?” by seeing how similar the memory is to the question. For example, suppose the question is “Where is the milk?” Then intuitively, memories that mention “milk” are probably more useful than memories that mention “soccer ball”.
The output module is calculated as the sum of the product of the previous softmax probability (p) and a new set of memory embeddings (c). I think of this as where the “actual” memories are stored, and the input module are instead using the memory embeddings for calculation purposes. Once again, an idea from the paper matches a more formalized concept, namely the input module is acting as an attention mechanism on the memories.
The answer module is a function of the sum of the output embedding with the question embedding. It’s interesting that the authors decided to sum the embeddings here rather than calculate an inner product. What is the intuition behind this decision?
Next, in order to really take advantage of longer term memory, the authors stack multiple input-output modules together, with each one allowing for another “hop” to look back at the sentence embeddings again. For example, for a model that has k=3 hops back, three input-output modules are stacked on top of each other before adding the answer module.
Results: MemN2N was one of the most amazing models when it came out, and although it produced lower accuracy compared to the strongly supervised model, the weakly supervised model yielded the best scores on BabI tasks compared to similar end-to-end models.
The answers given by the model are often one word, and even when they are two or more words, the model treats them as single phrases. Thus, the model is performing a multi-class classification problem over the vocabulary for BabI tasks. When the authors move onto language modeling, the success metric changes from whether or not the answer was correct, to a measure of perplexity. Specifically, on PTB the perplexity went down to 111 and 147 on Text8 dataset, beating out all other non-attention models of the time.
Remarks: Although it has been overtaken by more sophisticated models, it is worth studying to see the basic structure of memory models. It is also interesting to see that many of the “hacks” the authors put in have evolved into standard techniques as the understanding of the architecture has stabilized.

Multi-task Sequence to Sequence Learning (Luong, Le, Sutskever, Vinyals, Kaiser) [Read Oct 13th]

Idea: Seq2Seq has recently emerged as a new paradigm, but so far, most applications have focused on only one task. Thus, the authors want to explore multi-task learning (MTL), which can be divided into three settings:
- One-to-Many – input was English sentences, output was German sentences, syntactically parsed tags, and (unsupervised) English sentences
- Many-to-One – input was German sentences, Images, and (unsupervised) English, output was English captions (for the images).
- Many-to-Many – one translation task of German->English, and two autoencoder tasks for German sentences and English sentences.
Of course, other MTL could have different inputs or outputs, these just happen to be the particular use cases that the paper explored. In fact, the unsupervised tasks were originally trained as autoencoders, and later on as skipthought vectors. It is worth noting skip-thought vectors requires a corpus or ordered sentences (in order to look ahead at surrounding sentences), but the amount of data available in this form that is also translated (German->English) is limited. As a workaround, the authors split sentences in two and used the first half to predict the second half.
When learning, it sounds like the model will randomly alter between training one task versus training other tasks, for each mini-batch. The likelihood of choosing one task is based on the mixing ratio. For example, with a mixing ratio of 0.01 for POS tagging and 0.99 translation, then the system will randomly choose to train on POS at a really low probability. If the main goal is MT (termed the “reference” goal), then this lopsided training might be a good idea because even though the POS tagging has poor results, the model has learned “enough” from that task to help with translation and then avoids further mixing to prevent overfitting. This is like regularization against the secondary task? In any case, other methods for merging the tasks seem to be (1) alter after each epoch or (2) alter only after each task converges, which other papers have explored.
Results: Training on a small amount of parsing and image caption data can improve the translation quality between English and German by 1.5+ BLEU points over strong single-task baselines on the WMT benchmarks. Furthermore, they established new state-of-the-art on constituent parsing with 93.0 F1. Holy smokes, these results were achieved without attention mechanism!? Definitely going to be a follow up paper with everything a couple points higher.
Lessons: By training the auto-encoder at the same time as the task, it’s almost like performing pre-training at the same time as training for the end task.
For the unsupervised learning objectives, compared to the skip-thought method, autoencoder helps more on BLEU scores over perplexity, which I guess makes sense because BLEU measures overlap of words rather than minimizing complexity.
Larger networks matter for sequence models. In particular, the authors believe their non-attention model was competitive with an attention model (3-layer / 256 cell / 512-dim) because it had 4 layers of LSTMs, 1000 cells and 1000-dimension embeddings.

Grammar as a Foreign Language (Vinyals, Kaiser, Terry Koo, Petrov, Sutskever, Hinton) [Read Oct 14th]

Idea: Let’s train a Seq2Seq model with attention mechanism for the task of constituency parsing. From my understanding, POS tagging is predicting a tag (noun, verb, adjective, etc.) for each word. Shallow parsing is predicting part of the syntax tree, such that in addition to the POS tags, certain words can form phrases with associated tags (adjective + noun = noun phrase). Finally, constituency parsing usually refers predicting the entire syntax parse tree. (If this is wrong, please let me know so I can update.) This is cool because at the time, attention mechanism was still a relatively new idea.
In order to improve accuracy, the authors also generated an artificial dataset of parsed tags by labeling new data where both the Berkeley Parser and ZPar agreed on the tags (11 mil labels), in addition to the human-annotated parse trees (90k labels).
Results: F1 scores improve by a couple of points into the ~93 range compared to other networks, even those that are specifically trained for dedicated parsing. The speed of parsing is also faster while simultaneously being more robust to sentence length.
Lessons: Deep learning with LSTMs finally beats probabilistic context-free grammars (CFGs)! With a sufficient amount of data, the student can overcome the teacher.

Key-Value Memory Networks for Directly Reading Documents (Miller, Fisch, Dodge, Karimi, Bordes, Weston) [Read Oct 19th]

Idea: Question and Answering (QA) tasks for information retrieval is useful, and most past attempts attack this problem by setting up a Knowledge Base (KB) that has a pre-defined set of answers, and then pulling the right answer based on the query. Even with perfect understanding of the question, a system might still fail because the appropriate answer just isn’t found in the KB. Retrieving answers directly from text is no walk in the park either since information is unstructured, ambiguous and disorganized. So despite its drawbacks, KB are still the way to go for most closed domain searches. One step toward reading documents directly is using Information Extraction (IE) tools that try to pull out relevant sentences to be used in a KB.
To further bridge this gap, the authors propose Key-Value Memory Network (KV-MemNN). In a MemNN, the memory was often just 5-7 sentences that needed to be stored, so the “attention” portion tries to just pick from 5-7 embeddings. However, in free form reading, though we might see a document with 50+ sentences and 17,000 such documents. (i.e. This was the case in the WikiQA dataset the team created). To bridge this huge gap, the network stores each “memory” as a key-value pair, such that the key is specifically designed to match against a question, and the value part is what is actually used to answer the query later on.
Each KV memory can be composed in many ways:
- KB Triple: Key=subject, Value=object, from a knowledge based “subject relation object” triple.
- Sentence Level: Key = Value = BOW of each sentence. This isn’t particularly useful, but serves as a good baseline.
- Window Level: Key = window of X words, Value = the center word within the window
- Window + Center Encoding: Key = window of X words, Value = the center word as an encoding of its surrounding words, rather than just itself
- Window + Title: Key = window of X words, Value = the title of the movie. This memory is used in conjunction with the other memories above since it is weak by itself. This is pretty blatant feature engineering.
The general process follows three steps:
- 1) Key Hashing: Only return memories where at least one word in the question matches one word in the key (ignoring stop-words). This is pretty much the only real change to the original MemNN, so that the search space for the memory network reaches a tractable size.
- 2) Key Addressing: evaluate all candidate keys using some feature maps. This is the part most similar to the original MemNN. This step is performed repeatedly until it reaches a certain pre-defined number of hops.
- 3) Value Reading: Take the weighted sum of all the relevant memories and return those to answer the question.
Results: Team tested on a movie database, asking questions like “What movie did Harrison Ford star in?” or “What movies are top rated comedies?”. When the information source is better (ie. coming from a KB rather than a free-form document), the KV-MemNN performs better (93.9% vs. 76.2%). This is not too surprising because a KB offers much more structured information. In a realistic scenario, anyone would use a KB for QA, but the point is that in most domains, a KB is not available. It should be noted that using IE as the data source performed the worst at 68.3% accuracy. Overall, KV-MemNN is better than other tools (MemNN, Supervised Embeddings) to find the right answer out of a candidate pool of answers.
Lessons: The model’s abilities are frankly a bit disappointing because ultimately it just picks the right answer from a pre-determined set, rather than generating a novel answer based on the text it has read. Even the number of hops is hardcoded rather than dynamically determined. Despite all the fanfare in the introduction, this is simply a sophisticated pipeline component, rather than an actual innovative free text reader.

Generative Models (GANs, VAEs, Word2Vec, CBOW)

Generating vector embeddings for words and sentences, mainly through unsupervised methods.

Skip-Thought Vectors (Kiros, Zhu, Salakhutdinov, Zemel, Torralba, Urtasun, Fidler) [Read Oct 17th]

Idea: Word2Vec has seen tremendous success in encoding meaning into words with the famous King-male+female=Queen example. If Skip-grams work well for training words, then would the same technique work well for creating sentence level embeddings? It turns out that this works decently well when combined with a couple of other tricks. Formally, the skip-thought model is an GRU encoder-decoder where the inputs are the words in the current sentence, and the correct outputs are the words in the previous sentence and next sentence. The encoder uses a single stack bi-directional GRU, similar to bi-LSTMs where inputs are fed forward and backward, and final encoding is the concat (rather than a sum) of the two passes. The objective function is to minimize the cost of predicting the next two words (one for the sentence before and one for the sentence after) given all the previously predicted words and the hidden state of the GRU. Also, we add a log before summing the values together because the min of log(P(current word

past words)) is the same as min P(a

b).

In order to perform tasks on OOV (out-of-vocabulary) words, the model is cross-trained using data from word2vec such that each word in the dataset is mapped to a semantically similar space to words found in word2vec. The assumption is that word2vec has a significantly larger

. Then, during inference, when a new word is encountered, that word is “looked up” in the word2vec space, and replaced with a similar word that was found in the corpus. This pushes the 20k unique vocabulary (from BookCorpus dataset) into 931k possible words.

Results: In order to test the skip-thought (ST) vectors:

Freeze the weights in the encoder portion, and drop the decoder.
Tokenize your new sentence, replacing any OOV words with something similar from word2vec. Feed in your test sentence one word for each time step. This will produce an encoded sentence embedding, which we denote (u).
Then, to calculate similarity between two sentences, start by creating two features (or more) from the sentence embeddings. Assume the other sentence we are comparing to is called (w).

The first feature is a component-wise product (u · w). The second feature is an absolute difference

u - w

. Sometimes, add some other features such as statistical relationship or POS tagging.

Finally, join those two features together and feed in as the input into a classifier. The authors deliberately chose linear classifiers for this portion to keep things simple.

This type of set-up was used on three tasks:
- Semantic-relatedness: Pick two sentences and see how similar they are in meaning. If they are similar, give the pair a 5, if they are different, give the pair a 1. The ST model was able to accurately predict scores from 1-5 for pairs of sentences better than all rules-based models, and close to other LSTM models trained specifically for this task (SICK).
- Paraphrase detection: Pick two sentence and decide if they are paraphrases of each other. I guess this is similar to the previous task except instead of pick 1 through 5, you just pick 1 or 5, changing this into a binary classification task. STs are close to, but fail to beat other RNNs trained specifically for this task.
- Image-sentence ranking: Given a sentence, find the image that is associated with it. First, pass the sentence through Skip-thought model to get a sentence embedding. Then the image embeddings are the outputs from a 19-layer 4096-dim OxfordNet (i.e. ST doesn’t magically also process images). Then we look for the closest image based on cosine distance. Again, ST performs on par with, but does not surpass other LSTM methods.
Finally, ST were also tested on 5 other standard classification benchmarks - movie review sentiment (MR), customer product reviews (CR), subjectivity/objectivity classification (SUBJ) , opinion polarity (MPQA) and question-type classification (TREC). In all cases a simple logistic regression classifier was used (rather than some non-linear NN). Also, in all cases ST by itself wasn’t quite as good as other models. By adding Naive Bayes features, ST+NB performs on par with others (AdaSent model wins overall).
Lesson: Since ST is a generative model, it can be used to write stories (similar to the Char-RNN from Karpathy). Although not particularly useful, this is always fun :)

Learning Distributed Representations of Sentences from Unlabelled Data (Hill, Cho, Korhonen) [Read: Sept 19th]

Idea: Sequential Denoising Autoencoders (SDAEs) and FastSent are proposed as two models that can be used to encode phrases (as opposed to words).
SDAEs work similar to DAEs for images in that a slightly corrupted input source is used to predict the original data source as the target output through a encoder-decoder LSTM network. For SDAEs, the noise in the corrupted sentence is generated by a combination of random (a) swapping of words or (b) dropping of words from the original sentence. The “Sequential” part comes in because the model encodes sequences of words, rather than a block of pixels.
FastSent work similarly to SkipThought vectors in that the source sentence is used to predict the surrounding sentences (before and after). It is much faster because rather than feeding in one word vector at a time, FastSent just sums the word vectors in the source sentence (u) to predict the sum of the words vectors in the two target sentences (v).
Results: Among different sentence models, the deeper or more complex models (SkipThought, SDAE and NMT) which require greater time and resources to train generally performed best when applied to supervised tasks. On the other hand, shallow log-linear models (CBOW, FastSent) performed best when measured on to unsupervised benchmarks because these tasks mostly compared distances between two similar pairs of sentences.
Across both unsupervised and supervised evaluations, the BOW DictRep with pre-trained word embeddings exhibits the most consistent performance by a noticeable margin, which suggests that when the ultimate application is unknown (and thus no specific model can be prescribed), DictRep is a great default model choice.
Lessons: Adding noise (or dropout) to models allows them to perform more robustly as seen in the higher performance of SDAEs over SAEs.
Strong performance from FastSent and SkipThought provides strong evidence for sentence-level distributional hypothesis, namely that the context in which a sentence occurs provides valuable information about its semantics.

Sense2Vec: Word Sense Disambugation (Trask, Michalak, Liu) [Read: Sept 18th]

Idea: Words often have multiple meanings, so modeling techniques (i.e. word2vec) that try to encode all of those meanings (or senses) into a single vector misses out on the context specific meaning of a word, and understandably has a negative effect on the eventual downstream supervised task. Previous techniques at disambiguation generally perform a clustering step first to find out the number of meanings for each term, and then run another network to generate the actual vectors for each term, one vector from each cluster. The authors method leverages supervised NLP labels instead of unsupervised clusters to determine the meaning of any given word.
Results: Their model is roughly 1-2% better than wang2vec. This is a bit underwhelming given the massively harder to obtain labeled dataset required to run sense2vec. Authors also completely fail to deal with the issue of how to disambiguate terms during training time versus actual live usage.

A Neural Conversational Model (Vinyals, Le) [Read Oct 16th]

Idea: Seq2Seq is awesome, so let’s use it to create a chatbot. The trick is to have unheard of amounts of data that only Google has the capacity to process. Up till now, most conversation AIs are trained using mostly hand-crafted tools on very specific domains such as booking an airline ticket or sending customers around in circles to prevent them from getting help with their internet service. In this instance, the authors still used two different models with different hyperparams, but the general architecture of the system remained the same.
Results: The results are quite impressive since the model is able to create coherent, multi-response conversations for a broad range of tasks, despite somewhat noisy input data.
- Closed-domain (IT helpdesk troubleshooting) - The model was trained using a single-layer LSTM with 1024 memory units. The input was a customer talking and the correct output is what the tech specialist said in response. The model was able to produce perplexity of 8 vs 18 using KN smoothed n-grams. The model is actually able to solve realistic IT issues. Read the paper for example dialogue.
- Open-domain (movie transcript dataset) - The model was trained using a two-layer LSTM with 4096 memory units. The input was any (pre-processed) sentence in a movie and the correct output is whatever the next sentence was in that movie script. This implies that in cases where a single person speaks two times in a row, the input and output aren’t really accurate - leading to noisy data. Surprisingly, the model is able to answer philosophical questions (e.g. What is morality?) and other queries not explicitly found in the movie database. There lacks a personality and sometimes the answers don’t make a lot of sense, but the fact that anything meaningful came out at all is very impressive.
Lessons: Attention mechanism was tested in the movie dataset, but did not yield any meaningful improvements. My guess is that this is a matter of fine-tuning and would actually make a difference if done a certain way.
Based on the output from other SOTA chatbots, it is clear that most other systems have hand-crafted rules because the answers are often semantically non-sensical (but syntactically correct) or exactly on point, which suggests that the latter is where the developer was able to create a rule, and the former is just an uncovered domain.

Alternative Optimization

Binarized Neural Networks (Courbariaux, Hubara, Soudry, El-Yaniv, Bengio) [Read Sept 17th]

Idea: matrix multiplications used during calculating gradients can be quite expensive, and memory used to store all the weights can also get expensive. We can lower this cost by reducing the weights from 32-bit signed doubles to 8-bits without losing much final accuracy. In fact, we can take this further to 2-bits, and even 1-bit, so that each weight is just a binary 0 or 1. This has an added benefit of simplifying arithmetic as well so that all calculations are just bitwise operations, which has significantly lower memory access costs.
While parts of this idea have been implemented in the past, no work has succeeded in binarizing weights and neurons, at the inference phase and the entire training phase of a deep network
Results: A specialized GPU kernel was able to achieve a speedup by a factor of 7, while still achieving near state-of-the-art results on multiple datasets (MNIST, CIFAR-10 and SVHN) on the order of 5% worse ( < 1% worse on absolute terms). binarized CNN can lead to binary convolution kernel repetitions, and the authors argue that dedicated hardware could reduce the time complexity by 60%.

Building Machines That Learn and Think Like People (Lake, Ullman, Tenenbaum, Gershman) [Read Sept 29th]

Idea: AI has made a lot of advances in the last couple of years due to deep learning, but we are still missing some critical pieces. In particular, real machines should be able to (a) understand causal relationships, rather than just pattern recognition (b) have a solid grounding in physics, psychology and other contexts and (c) have ability to learn new things on their own (and to do it quickly) by recombining existing representations.
Promising paths include attention models, memory networks, and reinforcement learning with experience replay.
Lessons: This paper is unlike the others because it does not present a new idea with results trying to beat some benchmarks, and rather is a call-to-arms to researchers to build general AI. I include this article because it is precisely what I believe is not necessary for a technology revolution in deep learning.
While the goal of a robot that thinks and acts like a human is a noble one, I believe we are decades away from this. More importantly, I believe that we do not need to achieve this goal in order for machine learning to make a large and meaningful impact in society.
My rallying cry is to think of what types of use cases are already enabled by deep learning because right now the general audience has heard so much promise about AI, and yet seen very little benefit. If that continues to be the case, then their anxious anticipation will turn into cynicism, causing the AI community to go through another winter. On the other hand, if we can temper the hype while producing something immediately impactful, however small, then everyone wins.

A Hierarchical Neural Autoencoder for Paragraphs and Documents (Li, Luong, Jurafsky) [Read: Oct 7th]

Idea: Long-term generation of text is hard for RNNs, especially when it gets to paragraph length (~120 words) rather than sentence length (~15 words). The authors introduce an LSTM auto-encoder that hierarchically builds an embedding for a paragraph from embeddings for sentences and words, then decodes this embedding to reconstruct the original paragraph.
First, note that a document is built up of many sentences, sentences are made of words and words are made of characters. In this case, during each time step of the RNN a word is fed in and the goal is to create a full document. The hierarchy idea is that when creating a document embedding in the middle of the Seq2Seq model, when a EOS token is reached, the current state gets immediately passed to the end of the encoding, rather than waiting for a EOD token.
Example: Let’s say that there are 3 sentences with 5, 4, and 6 words in them respectively. Rather than getting one entry at the end into the document encoding, we have 3 sentence vectors feeding into the document encoding. Additionally, rather than running attention on 15 words, we look back only on the 3 sentences.
Results: One of the goals is to improve local compositionally, which is to combine (compose) a sentence so that the neighboring (local) words have valid syntactical and semantic meaning. Recall that syntax is the grammar of a sentence, and semantics is the meaning of the sentence. For example, “The milk is triangular” is syntactically valid since it follow the allowed form “noun verb adjective”. However, the sentence makes no semantic sense. Another goal was discourse coherence, where the output sentence preserves not only most of the words, but also the ordering of the words and phrases within a document.
Lessons: As the authors acknowledge, an auto-encoder task is not very exciting, but it does prove that point that longer length document generation is possible. More meaningful applications of the technique are left to the reader.

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec (Christopher Moody) [Read Oct 12th]

Idea: Word embeddings (aka. distributed dense representations) have been shown to be effective at capturing semantic and syntactic information as shown by the famous King-Male+Female=Queen equation. Separately, topic models have been shown to be effective in offering interpretable summaries of documents. The author builds on this by creating a model that learns distributed word vectors (through a Skipgram negative sampling) jointly with topic vectors (through Dirichlet-distributed latent document-level mixture models). In contrast to typical topic models (like LDA) that produce continuous, real-valued dense representations, lda2vec produces sparse, positive-values that add up to 100%. Whereas the former gives a result that can be seen as an address in a (100+ dimensional) latent variable space where the document’s topics are clustered together, the latter gives the portion of the document that is attributable to its major topics (ie. 40% sports, athletics 55% salaries, negotations and 5% lawsuits, legal).
Not only is this model easier to interpret, the ability to look back at the associated word embeddings gives another way to figure out what is going on in your dataset. Just like normal topic modeling, the results become really interesting when you start to graph trends over time (how popular a particular product is from one year to the next), or what topics are closely related to each other (what products are often bought together).
Results: The model gets benefits from both worlds. Trained on a Hacker News dataset, on the word embedding side, the model learns relationships like:
- California + technology = Silicon Valley
- digital + currency = Bitcoin
- Javascript - browser + server = Node.js
- Mark Zuckerberg - Facebook + Amazon = Jeff Bezos
On the topic modeling side, the model learns that these words belong to the following topics:
- Internet Portals: DDG, Bing, Google+, iGoogle (for those who don’t know, DDG stands for DuckDuckGo)
- Bitcoin: btc, bitcoins, Mt. Gox, MtGox
- Gadget Hardware: the Surface Pro, HDMI, glossy screens, Mac Pro, Thunderbolt
Lesson: This works surprisingly well, and can be used for EDA. Also, SpaCy is awesome for tons of pre-processing!

Smart Reply: Automated Response Suggestion for Email (Kannan, Kurach, Ravi, Kaufmann, Tomkins, Miklos, Corrado, et al.) [Read Oct 15th]

Idea: Seq2Seq has been incredibly powerful for generating chat responses as seen in Neural Conversation Model, so Google Brain had the idea of actually putting this in production! Smart Reply automatically generates semantically diverse text suggestions that can be used as complete email responses with just one tap on a mobile device. The system is currently used in Inbox by Gmail and is responsible for assisting with 10% of all mobile responses. An initial study covering millions of examples showed that around a quarter of email responses were 20 tokens or less. I love this extra bit of user needs validation because as cool as it is to use LSTMs, great products need to solve legitimate problems too.
Process: Before any messages come in, decide on a set of predefined responses. First, clear out obviously poor responses, such as profanity, misspellings, or non-English phrases. Then, cluster actual user-generated content into semantically similar buckets (they chose 100 buckets). For each bucket, label a small handful (3-5) responses, turning the overall process into a semi-supervised learning problem. This helps scalability because the scoring model now chooses from 100 canonical responses rather than ~5 million. During inference time:
1. Pre-process incoming email to strip out headers and PII. Tokenize, sentence segmentation, removed previous quoted emails.
2. Pass through binary classifier to decide whether or not to trigger suggested responses from Smart Reply. Improves scalabiltiy since the full scoring system is not run for most queries.
3. Pick the top rated possible responses using Encoder-Decoder LSTM network. (no mention of attention mechanism?) The input is is 238 million messages, which include 153 million messages that have no response. The output of the decoder is a softmax on every word in the vocabulary, which can be used in one of three ways:
  - Pick a word given its likelihood, feed back into the model and keep sampling until you get < EOS> token.
  - Pick the top k words and feed it into a beam search that retains the top k prefixes and repeat until you get < EOS> token. If k is set to low enough, then this devolves into a greedy-based search where just the top word is chosen each time.
  - For each response from a small pre-defined set, score each word in the response and sum together to get a total score. Pick the response with the highest total score.
4. (Actually 3b): In more detail, the authors chose option 3 where they score each response. This could be accomplished through batch-processing, but since the system must score emails in real-time, offline jobs are no longer an option. In order to avoid a potentially devastating bottleneck in finding the top response from the set R, the team scores things differently. First, the elements of R are organized into a trie. Then, they conduct a left-to-right beam search, but only retain hypotheses that appear in the trie, so the computation complexity becomes a function of the number of words in a response, rather than a function of the vocabulary size. Now, we have the highest scored Response Cluster, chosen out of ~400 options (my math is that 100 clusters x 3-5 labels per cluster), and we must choose the best response within that cluster.
5. Pass through other algorithms to produce response diversity:
  - Avoid redundancy: “Sure, I’ll be there.” and “Ok, I’ll come.” are too similar
  - Penalize common (and thus high-scoring), but generic responses, such as “Yes!” or “OK” by adding adding regularization.
  - Focus on phrases show clear intent: “Yes!” vs “Yes, let’s meet up.” The former is unclear what the user is agreeing to.
  - Enforce positive and negative reviews: If the top two responses contain at least one positive response and none of the top three responses are negative, the third response is replaced with a negative one (and vice versa). 6. Send the top three responses to a web server to send results to a mobile device.
Results: The authors checked for Perplexity, Mean Reciprocal Rank and Precision@K, but let’s not look at datasets or benchmarks. You can download Inbox on your phone and use it today :)

Long short-term memory recurrent neural network architectures for large scale acoustic modeling - Recurrent projection layers?

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations (Krueger, Maharaj, Kramár, Pezeshki, Ballas, Ke, et al.) [Read Oct 18th]

Idea: Dropout works well for CNNs and feed-forward networks, but can be difficult to apply for RNNs. In particular, the goal of LSTMs is to preserve information over long-time steps, so if that information is randomly dropped, the negative consequences quickly start to outweigh the regularization benefits. Some people have tried limiting dropout only to vertical RNN layers (stacked RNNs), or only to certain gates (e.g. input gate on LSTM) to allow more information to propagate through. Rather than setting some units’ activations to 0 (which is what dropout does), the paper proposes Zoneout, which sets the new hidden state (h_ t) of units as just the hidden state (h_ t-l) of the previous time step, as if nothing had changed. This is like performing a identity operator on cells rather than a null operator (dropping them).
Compared with dropout, Zoneout is superior because it allows more information to flow forward and backward. Compared to Recurrent Dropout without Memory Loss, Zoneout passes the hidden state and memory cell unimpeded, while the other keeps the memory cell, but stochastically drops the hidden state. Not only does Zoneout keep more information, but it is more generalizable (to units without gates) and just seems cleaner to implement. As the authors note, “zoning out a unit in an RNN preserves not only its activation, but its meaning as well.” Compared with Stochastic Depth networks from [Huang et al., 2016], Zoneout can be viewed as a per-unit version of stochastic depth. (Applying the best ideas from CNN land into recurrent setting, how awesome it that!?)
Results: Zoneout yields state-ofthe-art results in character-level LM on PTB, and close to state-of-the-art on word-level PTB perplexity. It also did quite well and permuted sequential MNIST classification. Given the simplicity of the implementation compared to other RNN regularizers, Zoneout is a pretty strong winner.
Lesson: You can’t apply zoneout to an entire layer, like with stochastic depth networks because in a typical RNN task, “there is a new input at each time-step”. For example, if we were translating a sentence, serious issues would arise if the word that was randomly “dropped” happened to be critical to comprehending the sentence’s meaning.

On Multiplicative Integration with Recurrent Neural Networks (Wu, S. Zhang, Y. Zhang, Bengio, Salakhutdinov) [Read Oct 20th]

Idea: RNNs are used in a wide variety of language modeling and NLP tasks using the standard formula of φ(Wx + Uz + b), where x is the input at each time step, z is the hidden state being passed over, b is the bias term, and φ is some sort of non-linearity function. Additionally W and U are the associated weight matrices. The relationship between the two inputs is an additive one, however the authors discovered that using the Hadamard product “ʘ” as the relationship can yield great benefits. This yields the equation φ(Wx ʘ Uz + b), which is termed the Multiplicative Integration (MI).
After a bit of manipulation, and adding a weight factor “α”, we get the formula: φ((α ʘ Wx ʘ Uz) + (β1 ʘ Uz) + (β2 ʘ Wx) + b). The key idea is “Wx” and “Uz” are multiplied together rather than added, which gives the benefit that they act as gating functions for each other.
The biggest gain comes when examining the gradient of this similar, but different function. In particular, the original gradient with respect to the hidden state is:
where φ’_k = φ’(Wx_k + Uh_k−1 + b). Note that the gradient flowing back is dominated by the U-transpose matrix associated with the hidden state. Using the new formulation, the new gradient w.r.t. the hidden state becomes:
Notice that now the W-matrix has a prominent role outside of the φ’ function. With Wx_k directly part of the gating mechanism, the vanishing/exploding gradient problem is alleviated.
Results: Tested on PTB, text8, and HutterWikipedia datasets for character-level language modeling, the MI-RNN has lower bits-per-character (BPC) perplexity than other models. Testing on the WSJ corpus for speech recognition, the MI_RNN performs at or near the state-of-the-art. Next, when trained on the BookCorpus dataset for generating Skip-thought vectors, the MI-RNN does mostly better than anything out there. Finally, when applied to the CNN dataset for an Attentive Reader Model, the MI mechanism did noticeably better on validation accuracy than other models.
Lesson: One of the key bonuses of this methodology is that it is readily applied to LSTMs and GRUs in addition to vanilla RNNs. Additionally, the number of new parameters (α, β1, β2) is relatively negligible compared to the size of the weight matrix.

sk_p: a neural program corrector for MOOCs (Pu, Narasimhan, Solar-Lezama, Barzilay) [Read Oct 23rd]

Idea - MIT students apply a Seq2Seq LSTM model to the task of correcting code snippets from programming assignments given in MOOCs. Concretely, given an input of an incorrect program, the model attempts to generate a correct program. If one is found, then the solution and the steps required to reach that solution can be given back to the student to show how her program could have been improved.
The high-level process for doing so is as follows: Take in incorrect programming assignments, and replace variable names into more generic structure (and other basic preprocessing). Given a program of 10 lines, tokenize into 10 lines and add the start-of-program and end-of-program tokens. For each training example, split up into trigrams. For example, the first input would be the start-of-program and the 2nd line. The correct output would be the 1st line of the program. The second input would be the 1st and 3rd line, and the correct output would be the 2nd line. This pattern would continue until the final input as the 9th line and the end-of-program token, with the correct output as the 10th line of the program. Although the paper states their system is heavily inspired by Skipthoughts and Skipgrams (going so far as to name their system the “sk_p” model ), I believe this actually works closer to the CBOW architecture found in word2vec.
During training, the correct program is given as the output. During inference, beam search is used to generate possible programs. So for our 10 line program, maybe the top 5 generated statements per line are returned (for a total of 50 statements), and then beam search is used to determine the highest probability program until an end-of-program token is reached. This program is then evaluated against a unit-test suite to see if it can be considered a correct program. I am assuming the benefit of doing so is that the program can possibly generate novel programs that are (a) similar to the student’s program (b) pass the test suite but (c) do not match the canonical correct answer, otherwise the student could just look at the correct answer to see what they should do.
Results - The model is able to correct 29% of cases automatically which seems low, but is actually better than most autograders, even those programmed specifically for this task. I hesitate to say “trained” for the task, since most autograders do not deploy machine learning. In an exhaustive version of the the sk_p model where the system simply memorizes the correct snippets, the model is able to achieve 35% test accuracy.
Remarks - Although the paper heavily emphasizes the superiority of their model due to its generative (rather than discriminative) nature, the final model still seems very targeted to the specific task at hand. In other words, the model has not gained some internal understanding of programming or Python, but instead has just learned a statistically superior method of generating syntactically correct programs. In some ways the model is quite amazing for learning the common syntax errors made by beginner programmers, rather than having those patterns hard-coded. However, this isn’t nearly the same as actually understanding why the programmer made the mistakes she did.
Ultimately though, this is an actual application of deep learning and it was done by students who probably had midterms and other obligations to balance at the same time. It’s certainly more impressive than anything I can throw together, so I only have hearty approbation to give.

Bag of Tricks for Efficient Text Classification (Joulin, Grave, Bojanowski, Mikolov) [Read Oct 24th]

Motivation: Text classification is a pervasive and impactful task all large tech companies face, with spam detection and clickbait filtering being some common examples. While deep learning has certainly pushed the envelope in terms of improving accuracy, the best models universally take a long time to train and run. When running to compete in start-up land though, its oftentimes ok to drop a few percent in accuracy for a gain in efficiency, especially if the speed up is substantial.
Results: With this in mind, the authors present fastText, a shallow neural network that can be trained on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 300K+ classes in less than a minute. For comparison, the deep learning models with similar accuracy results might take more than a day to train on a GPU, and over an hour to classify the same number of sentences. The authors tested this on sentiment analysis and tag classification tasks and saw results of +/- 1% on many of the datasets, sometimes even doing better than certain deep learning models.
Idea: So what is in their “bag of tricks”? To start, this is not a deep network with LSTMs or convolution kernels or some external memory. It’s just a shallow neural network with a softmax at the end. Some other ideas to make it go fast:
1. Rather than a plain softmax over the 312K classes, use a hierarchical softmax based on a Huffman encoding
2. Prune leaves using depth-first search when the probability of a set of classes becomes too low
3. Structure the tree as a binary heap such that classes that are more likely to be predicted get placed higher up in the tree.
4. Use a hashing trick to select inputs when forming n-gram features.
5. Use multiple CPUs rather than just one.
These explain how to go fast, but how is it still so accurate?
1. A shallow neural net is still more powerful than a SVM or other linear classifier since information is shared via the hidden layers.
2. Rather than a pure BOW model, use a bag of N-grams model such that some positional information is still passed along
3. Occasionally boosting the features, such using trigrams over bigrams can help.
4. The network is actually not that powerful, but the tasks (sentiment analysis, spam vs. not spam) are so simple that the extra horsepower coming from deep nets becomes redundant. If I’m just trying to cross the street (rather than run a 5k), then a car will surely still be faster than walking, but only by so much.
Remarks: fastText seems like an awesome tool to use for production environments. Even for research purposes, it should serve as a quick and easy baseline to compare other methods against. Did I mention the code was open-source? Yay!
Note that paper focuses mainly on supervised classification. This same model can also be used for unsupervised training of word embeddings. A summary of a that paper by the same authors can be found below.

Enriching Word Vectors with Subword Information (Bojanowski, Grave, Joulin, Mikolov) [Read Oct 25th]

Idea: Distributed word representations created through tools such as word2vec are quite powerful for many upstream NLP tasks, but often work quite slowly. Additionally, they often ignore the sub-word information, such as morphemes or characters, by assigning a distinct vector to each word. The authors come up with a slight variation on the skip-gram model, where each word is represented as a bag of character n-grams.
Concretely, the input into the model is the center character, which is used to predict its surrounding context characters. For example, we are trying to encode the word “squeezed”, and we choose our n to be 5. Potential way to create n-grams (not sure):
- Input: s, Desired output: < start> and q-u-e
- Input: q, Desired output: < start>-s and u-e
- Input: u, Desired output: s-q and e-e
- Input: e, Desired output: q-u and e-z
- Input: e, Desired output: u-e and z-e
- Input: z, Desired output: e-e and e-d
- Input: e, Desired output: e-z and d-< end>
- Input: d, Desired output: e-z-e and < end>
In order to create a word, all of its n-grams are summed together. In practice, the n is between 3 and 6. The cost function is negative log likelihood with negative sampling. This is powerful because (a) it helps with OOV words since prediction happens at the char level (b) words that are rare, but share similar morphological structure, such as “squeeze”, can benefit from the sharing of sub-word information.
Results: The fastText word embeddings are tested on on five different languages of Wikipedia (at varying token sizes) for word similarity and analogy tasks. For word similarity, Spearman’s rank correlation coefficient was used to measure the cosine similarity between human judgement and the word representations. In almost all cases, the fastText method performed better. For analogy tasks, the question is “If A:B, then C:D”, and the goal is to predict D given A,B and C. Once again, fastText performs decently well near the state-of-the-art.
Remarks: The other paper mentioned speed as a huge factor, but this paper seems to state that for the unsupervised learning tasks, it could actually be slower by a factor of 1.5x compared to word2vec.
Note that paper focuses mainly on unsupervised training. This same model can also be used for supervised tasks including sentiment analysis or tag classification. A summary of that paper by the same authors can be found above.

Teaching Machines to Read and Comprehend (Hermann, Kocisky, Grefenstette, Espeholt, Kay, Suleyman, Blunsom) [Read Oct 26th]

Idea: Teaching machines to actually comprehend a story (rather than memorize corpus co-occurence statistics) has been elusive up to this point, but RNN models with attention have been quite powerful. Even with the right model though, it’s hard to tell if the system is working because we don’t have robust datasets. Thus the authors contribute (a) a new dataset and (b) an evaluation using two types of attention models.
The dataset needs to solve a number of problems, which the authors address as such:
1. Let’s first define the problem. If a machine can read and comprehend, it should be able to read a summary of some text, and then correctly answer a question about it. Thus the goal is to find meaningful document-query-answer triplets. This should be done in a scalable manner so that we get magnitudes more than the 100s of examples currently available.
2. To get the documents, the authors pulled from CNN and Daily Mail articles, which notably, come attached with separate summaries that “are abstractive and do not simply copy sentences from the documents.” Thus, Cloze-style queries can be created where a question is made my replacing a keyword with a blank. For example, suppose a summary about the document originally said “The coffee buyers shipped their supply from Argentina”. Then the question might become “The coffee buyers shipped their supply from __” and the right answer is supposed to be “Argentina”.
3. Certain questions can be relatively easily guessed using n-grams or other co-occurence methods, without actually understanding the article. For example, if asked “a) The hi-tech bra that helps you beat breast X; b) Could Saccharin help beat X ?; c) Can fish oils help fight prostate X?” A model could guess “cancer” simply because the phrase “breast cancer” is very common. To get around this issue, the entities are randomly masked so words like “breast”, “fish oils”, or “saccharin” are changed to “ent924” or “ent317”.
For the models, the authors used a LSTM, an attentive reader and an impatient reader. The LSTM is bidirectional and takes in a token word at each time step, then a delimiter, then the query tokens. The desired output is the missing word from the Cloze question. The LSTM also employs skip connections from each time step directly into the output. The Attentive Reader is very similar, except it also employs an attention mechanism that can look back at individual words (rather than at sentences). A context vector is built using the encoding after the final query token, and then used once to attend over the entire input. The Impatient reader builds on this by allowing the context vector to be recurrently updated with each token in the query (rather than just one time at the end of the query). The idea is that this would allow the context vector could “accumulate information” from the document over time.
Results: The authors established a number of smart baselines to compare against. For example, to answer the question, choose the most common “ent###” that is found in the document, but not in the question, which ends up getting the right answer 39/33% of the time on CNN/Daily Mail. Another idea is to measure word distance from the question to each possible entity in the document, then choosing the closest matched entity. This achieved 51/56%.
The vanilla LSTM achieved 57/62%. The attentive reader was best in class on the Daily Mail dataset with 63/69%. The Impatient Reader receives highest marks on the CNN dataset with 64/68%. To me, this is basically a wash between the two attention methods. Additionally, I believe the overall results show that deep learning has made significant progress in actual reading comprehension, but at ~70% for the best models, there’s still a lot of room for improvement to reach human-level accuracy.

Attention-over-Attention Neural Networks for Reading Comprehension (Cui, Chen, Wei, Wang, Liu, Hu) [Read Oct 27th]

Idea: The authors introduce “attended attention” model for answering Cloze-style questions (see Oct 26th for explanation). Since the goal is D-Q-A processing (document, query, answer), there are two types of inputs. Concretely, the document tokens are fed into a bi-directional GRU. Then the query tokens are fed into its own, separate bi-directional GRU. Finally, some processing is done to before predicting the single word answer.
Whereas the typical attention model might directly attend over tokens in the two GRU outputs, the AoA (attention over attention) models takes this one step further. First, a new matrix (M) is created that is the pair-wise matching score by taking the dot product of each word in the document and each word in the query. Next, two separate context vectors are created - one that focuses on document-to-query importance, and the other that focuses on query-to-document importance. Finally, a dot-product of these two attentions is calculated to create a “attended document-level attention” which is used to generate the final answer.
Results: The AoA model gets 74.4% / 72.0 % / 69.4% on the three datasets, which is better than other single models, but is certainly not the “large margin” as touted. The best other models, including GA Reader and EpiReader achieve roughly 74% / 69% / 65% test accuracy on the same three datasets. Moreover, when the other models are ensembled, they are able to surpass the AoA - the authors did not attempt an ensemble.
Lessons: What is going on with this naming? Gradient descent by gradient descent?! More seriously, while the model does seem to be an interesting method for getting more mileage out of the query to determine what specific interaction is worth paying attention to, this seems like an incremental improvement, than a fundamentally new idea.

Neural Semantic Encoders (Munkhdalai, Yu) [Read Oct 28th]

Idea: Present a new memory augmented network that is more powerful than most, aimed directly at natural language understanding (NLU) as opposed to algorithmic tasks as performed by a NTM. Neural Semantic Encoders (NSE) comes with 4 modules:
- Read (LSTM): After transforming into a word embedding, the hidden state (o) is dot-producted with the current memory, and normalized to get an attention/context vector. This is used to select which memories to read.
- Compose (MLP): Based on what is read, a new hidden state (h) is calculated with a multi-layer perceptron. This value will then be passed to the downstream task.
- Write: this new hidden state (h) could also be written back into memory, and overriding what was previously there. Deciding whether or not to write is a tradeoff made by the attention mechanism above.
- Memory Encoding: this is the storage unit actually holding the memories. The downstream task will obviously have access to this data bank as well when making predictions.
Results:
Unlike NTM, the memory size of NSE is not fixed beforehand and is adapted for a particular input. For example, longer sentences with more words will have a longer memories.
Unlike MemN2N, the number of hops of NSE is not pre-determined.
Unlike DMN, the of NSE is
Unlike Neural Program-Interpreters (NPI), the NSE is
Remarks: With Read, Compose, Write and Memory modules, it almost seems like the input, candidate, output and forget gates respectively. This yields tools for a internal, long short-term memory, and similar tools for an external, long long-term memory.

Towards Neural Network-based Reasoning (Peng, Lu, Li, Wong) [Read Oct 29th]

Idea: MemN2N works really well on most bAbI tasks, but struggles with a couple of them. During the same time, DMN also came out and had two that were particularly troublesome, namely (17) Positional Reasoning and (19) Path Finding. A Neural Reasoner is able to boost this up from for 68.1% to 98% on PR and 33.4% to 87% for PF, when compared to MemN2N. The major drawback though is that it only worked on the 10K dataset. Notably, the model did not work as well on the 1k dataset relative to the MemN2N, and in fact performed significantly worse, which might mean it was simply able to memorize the questions at 10k.
Similar to DMN (but unlike MemN2N), a Neural Reasoner (NR) uses a GRU for encoding questions and facts (rather than just a sum of words). Then the NR’s contribution is a “reasoner” that recursively performs updates on this encoding to prepare it for later. The reasoner seems to perform three operations:
1. Join the question and fact at each time step through its own deep neural network. Compared to DMN and MemN2N, this is similar to taking the dot product of the question and the facts to generate the new candidate memory.
2. Then the different outputs, which the authors oddly view as updated questions (as opposed to updated memories/facts) are pooled together. This is quite odd since this is where the external memory augmentation becomes useful, but instead a NR loses the distinctness of the facts by mixing them together. It’s almost the opposite idea of attention mechanism - rather than extract 1 fact out of 5, the NR mixes the 5 pieces of information together.
3. Next, similar to MemN2N, it performs a pre-defined number of “hops” this way to continue refining the internal encoding.
After the iterations are complete, the network has an output encoding. This encoding is fed into a typical softmax to perform the final prediction task.
The type of deep neural network seems to be flexible, which can be seen as strength or weakness of the system, depending on your point of view. Something that is pretty cool is that this portion can be augmented using auxillary learners. In particular, during each “hop” of training, for each “query encoding” (aka memory), that encoding can also be trained to either (1) reproduce the sentence itself, like an auto-encoder or (2) reproduce a close variation of the sentence where the specific names of objects “Peter”, “ball” for “Peter dropped the ball” is replaced with generic variable “X” and “Y”. The paper presents two example auxiliary tasks, but theoretically other options are also possible. By adding these extra tasks, each encoding not only receives gradient updates from the final answer, but also from the auxiliary task (which is much closer, especially for the first few hops).
Results: As mentioned above, the NR performs better than DMN and MemN2N on key tasks, with certain caveats. The auxillary modules are also critical to success, so it seems the core NR architecture is not as strong, but the idea of using auxiliary modules is a good idea worth pursuing.

Pointer Networks (Vinyals, Fortunato, Jaitly) [Read Oct 30th]

Idea: Seq2Seq and NTM models are extremely powerful, but in some ways limited in their expressive ability, namely when the size of the output is variable based on the input. In any case, the authors introduce a new neural architecture (Pointer Net = Ptr-Net) that learns to maximize output probability conditioned on discrete input tokens, rather than real-valued inputs that come from an attention mechanism (?). These input tokens represent positions in the input sequence by turning the softmax probability distribution of context vector into a pointer. (But still trainable with gradient descent?)
In a normal attention-based seq2seq model, the output dictionary size for all symbols is fixed beforehand, since the outputs are chosen from the input. Thus, we need to train a separate model for each n. This prevents us from learning solutions to problems that have an output dictionary with a size that depends on the input sequence length. (What about using beam search or greedy search when generating variable length sequences? This varies the output length, but not as a function of the input?) Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output.
Results: Pointer Networks can find approximate solutions to three computationally intractable problems, namely planar convex hulls, computing Delaunay triangulations, and the planar Traveling Salesman Problem – using training examples alone rather than brute force analysis. Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries since the models were able to generalize beyond the maximum lengths they were trained on. For example, the inputs had size n=50 during training, but during inference, the inputs had size n=500.
Lesson: Check out http://cslt.riit.tsinghua.edu.cn/mediawiki/images/0/07/Ptrnets.pdf for more. Also note that, we can solve the same problem artificially using seq2seq and outputting “coordinates”, but that ignores the output constraints and would be less efficient. This ultimately feels like a slight tweak of attention model for use on a different class of problems. It is hard to compare the idea to other models since the goal is so different - this one wouldn’t be able to solve LM and the others can’t solve this one. What’s interesting is the flexibility since different attention modules can still yield useful results.

Natural Language Comprehension with the EpiReader (Trischler, Ye, Yuan, Suleman) [Read Oct 31st]

Idea: EpiReader is an end-to-end model for reading comprehension that has two major components
1. Extractor - proposes a small set of candidate answers after comparing a question to its supporting text by running an pointer network attention over the input sentence, with the context vector conditioned on the question. As a pointer network, single words are selected by a bi-GRU to be potential candidates to the Cloze style questions (rather than softmax over all words).
2. Reasoner - formulates hypotheses using the proposed candidates and the question, then reranks the hypotheses based on their estimated concordance with the supporting text. Let’s say the Extractor came up with 3 possible candidates, then the Reasoner would plug each one into the solution and “rate” them. Rating is performed as (a) split input paragraph into sentences (b1) use a CNN to generate abstract representations of each sentence (b2) another CNN to get the hypothesis encoding (b3) a similarity score “ς” between the hypothesis and each sentence is calculated as a quadratic form, which is sRh where s is the sentence, h is the hypothesis and R is a trainable matrix. (c1) RNN to measure textual entailment with input as a concat of s, h, and ς (c2) with the same RNN, aggregate entailment over all sentences. The word that produced the highest rating would then be returned as the response.
The name of the model comes from the philosopher Epicurus who first formalized the idea of Principle of Multiple Explanations: if several theories are consistent with the observed data, retain them all until more data is observed.
Results: Tested on the CNN/Daily Mail and Children’s Book Test machine comprehension benchmarks, EpiReader outperforms previous neural models by a significant margin. Roughly 1.7% to 2.2% better absolute test error over other models. As context, sometimes the “human level accuracy” could be around 75% and EpiReader reaches 74%. Many other models are around mid- to high 60s. Holds the leadership spot from June 2016 to at least Oct 2016.
Lesson: By leveraging the observation that the desired response is just a word, EpiReader is able to make a targeted attack on the problem, but now fails to generalize to other tasks. This is a bit disappointing because it feels like overfitting to the task at hand. In a normal chat dialog or recommendation task, the right “answer” won’t just be one word, and might not even exist in the input (but rather has to be retrieved from an external data source).

Dynamic Co-attention Networks For Question Answering (Xiong, Zhong, Socher) [Read Nov 16th]

Rajpurkar et. al. released SQuAD (Stanford QUestion Answer Dataset) with 100,000+ questions for machine comprehension of text in June 2016 with a baseline of 51% accuracy. This dataset is notable for (a) having many more labels than most other QA datasets, which boosts quantity that is important for deep learning and (b) having fairly realistic questions that are varied in type (who, when, where, how, why) and length (one word answers, sentence long answers), which represents a boost in quality over simplified Cloze-style questions.
Idea: Best way to tackle QA dataset is to use allow the system to prime the system by letting the model know the question before looking at the document, and also allowing it to take multiple passes back at the document. In order to let the model know about the question, a co-attention mechanism is used on the encoder side that does a pass where the question is conditioned on the document, and the document is conditioned on the question. Concretely, they “compute the summaries, or attention contexts, of the document in light of each word of the question” and vice versa. This takes the form of an “affinity matrix which is normalized row-wise to produce the attention weights AQ across the document for each word in the question, and column-wise to produce the attention weights AD across the question for each word in the document.” Some more matrix products and concatenations are performed to get to the final encoding. I should also mention a “sentinel vector” is also added to the document and question matrices to represent “nothing”. This is useful for the decoder part that needs to take this internal representation of the document/question and actually produce an answer.
The decoder part of the network is a pointer network that uses a bi-LSTM for processing, which is fed to a Highway Maxout Network (HMN). This is a combination of a Highway Network that has skip connections and a Maxout Network that ???. The pointers will look back at the document and choosing a starting point “s” for the answer and an ending point “e” where the answer finishes. Since it is a pointer network, these are discrete locations and not some smoothed probability.
Results: On the SQuAD dataset, the DCN achieves the state of the art results at 75.9% F1 with a single model and 80.4% F1 with an ensemble. The DCN significantly outperforms all other models with the closest F1 at 78.3% from Microsoft Research Asia. Allen Institute, Singapore Management University, and Google NYC also have competitive scores in the mid-70s, but not quite as good as DCN.
Remarks: Squad was supposed to be a substantially more difficult dataset, and it sure seems that way. Yet in less than 6 months we went from ~50% test accuracy to ~80% accuracy, and human-level performance is ~90%. Was this too easy, or is the pace of Deep Learning research just too crazy?!

Emoji2Vec: Learning Emoji Representations from their Description (Eisner, Rocktäschel, Augenstein, Bošnjak, Riedel) [Read Nov 2nd]

Idea: Emojis, such as 😂 and 😋, are included in about ~10% of Twitter posts and half of Instagram posts, yet are rarely every discussed in NLP research. Barbieri et al., 201 generated vector embeddings for emoji similar to skip-gram-based vectors by training on a large Twitter dataset of over 100 million English tweets using the skip-gram method (Mikolov et al., 2013a). While this method is able to learn robust representations for frequently-used emoji (about 700 of them), representations of the other 900+ emoji are estimated rather poorly or not available at all. Thus, the authors approach contains two distinct differences:
1. Train on emoji directly on their description (similar to the dictionary training by Hill et al., 2016) rather than surrounding words in a tweet, which allows for coverage of the long tail of infrequently used emojis.
2. Works with much less data since only one English description is needed. This is advantage in speed, but also engenders nontrivial disadvantages. Namely, English representations tend to limit the encoding usefulness when interpreted in the context of different languages or appropriated by other cultures. Additionally, only using static descriptions rather than a live context might miss out on semantic meaning that only appears temporally relevant.
Results: The authors projected the encodings into a 300-dim space that coincides with Google’s word2vec space. They found the embeddings to be useful in Twitter sentiment analysis task. The results can also be qualitatively evaluated by examining the vector space. Some notable examples include:
- 👑 - 🚹 + 🚺 = 👸
- 💵 - 🇺🇸 + 🇬🇧 = 💷
Remarks: This is interesting, but probably only useful in tasks related to Twitter and social media. Although that is limited now, I can definitely see user preferences (for recommendation systems) being constructed on texts or tweets they have sent in the past.

Neural program interpreter? which learns to run sub-programs and to compose them for high-level programs. It uses execution traces to provide the full supervision Neural Transducer? Newest NTM? Character level recognition from K. Cho? Each candidate answer works like a memory in a MemN2N or DMN. EpiReader seems to take the idea further though because rather than having a memory represent an encoded input, instead each memory represents a candidate answer. More specifically, each input is further processed in order to check its validity as a candidate answer before being stored as a memory. (This is genius for multiple reasons, not the least of which is it saves with storage costs.) Concretely, the input is processed by …