Get answers and suggestions for various questions from here

In-depth analysis of LSTM and Word Embedding, and improvements to RNN/Attention


This article is from Zhong Zhiting, a fan of Jizhi Club.

The submission, Zhong Hanting graduated from Huazhong University of Science and Technology, and now works as an algorithmic engineer at Ain Interactive Technology Development (Beijing) Co., Ltd., engaged in NLP related work, has rich knowledge reserves and practical experience. Here are some paper guides he wrote.

LSTM: A Search Space Odyssey (arXiv: 1503.04069)

Author: Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnik, Bas R. Steunebrink, Jurgen Schmidhuber


Since the advent of the Long Short Term Memory (LSTM) for cyclic neural networks in 1995, a number of variants of LSTM have been proposed. In recent years, these network models have become the most advanced models of various machine learning problems, which has reignited interest in the role of various computing components in the structure of typical LSTM and its variants. In this paper, we take the three problems of speech recognition, handwriting recognition and music modeling as an example, and carry out large-scale analysis of eight LSTM structures. We used a random search to optimize the hyperparameters in the eight LSTM structures separately and used the powerful fANOVA framework to evaluate the importance of these superparameters. We conducted a total of 5,400 experiments and summarized some of the ideas from these experiments—these experiments took approximately 15 years of CPU time, which is the largest of the similar studies on the LSTM network. The experimental results show that the eight variants in our experiment did not significantly improve the three tasks compared to the classical LSTM structure. We also proved that the forget gate and the output activation are LSTM. The most critical part. We further observed that the super-parameters of the model were almost independent, and at the same time got some guidance on the effective adjustment of these super-parameters.

Brief comment

This paper compares the effects of eight different LSTM variants with large-scale data. Six of these eight variants were obtained by removing some of the computational components of the classic LSTM to verify the importance of some of the components; the remaining two are GRU-like structures that combine the input gate and the forget gate. There is also a full gate recurrent structure with recurrent connections between all gates. Some of the main conclusions of the experiment are: the get gate and the output activation are the most important components in the LSTM; the learning rate and the network size are the more important super-parameters in the LSTM training, and the momentum term has almost no effect; the LSTM super-parameters are almost mutual independent. If you don't want to know too much detail, you can look directly at the Conclusion section of the paper.

Understanding Neural Networks Through Representation Erasure(arXiv: 1612.08220)

Author: Jiwei Li, Will Monroe, Dan Jurafsky


Although neural networks have been successfully applied to many natural language processing tasks, they are still not enough to explain. In this paper, we present a general approach to analyzing and interpreting neural network model decisions—by erasing certain parts of the input representation, such as some dimensions of the input word vector, some of the hidden layers. Neurons or some words entered. We propose several methods to analyze the effects of this erasure, such as comparing the differences in the evaluation results of the model before and after erasure, and using reinforcement learning to select the minimum set of input words to be deleted, and the classification of the neural network model used for classification. The result has changed. In a comprehensive analysis of multiple NLP tasks (including linguistic feature classification, sentence sentiment analysis, and document level sentiment aspect prediction), we find that our proposed method can not only provide a clear explanation of neural network model decision, but also can be used for Error analysis.

Brief comment

An interesting article reveals that there are some obvious differences between the word vectors produced by Word2Vec and Glove, and that the word frequency in the training corpus has a great influence on the expression of the generated words; at the sentence level Experiments on sentiment analysis show that emotional words have a significant impact on sentiment classification results. It is interesting to find some words that misclassify the model; at the document level, the aspect prediction experiment clearly reveals which part of the text and specific in the document. Aspect is strongly related. At the same time, these experiments show that the bidirectional LSTM has stronger expression ability than the classic LSTM, and the classic RNN is the weakest.

Interactive Attention for Neural Machine Translation(arXiv: 1610.05011)

Author: Meng Dong, Lv Zhengdong, Li Hang, Liu


The common attention-based Neural Machine Translation (NMT) dynamically aligns when generating the target language. By repeatedly reading the representation of the source language sentences produced by the encoder, the attention mechanism greatly improves the effect of the NMT. In this paper, we propose a new attention mechanism called "Interactive Attention". During translation, the decoder not only reads the representation of the source language sentences, but also modifies these representations. Interactive Attention records the interaction between the decoder and the source language sentence representation, thus improving translation performance. Experiments in the NIST Chinese-English translation task show that our Interactive Attention model has a significant improvement in performance compared to the original attention-based neural machine translation model and other improved models (such as the Coverage Model). . On multiple test sets, the neuromechanical translation system using Interactive Attention has an average of 4.22 higher than the open-source attention-based system and 3.94 higher than the open source statistical machine translation system Moses.

Brief comment

The Coverage Model and this article mentioned in this article are all intended to further improve existing NMT methods and reduce over-translation and under-translation issues. The method proposed in this paper is an improvement on the attention mechanism. The idea is very simple. It means that the representation of the source language sentence generated by the encoder - that is, the hidden state sequence of the encoder is regarded as a readable and writable memory, not only at the time of decoding. The weighted read is also done with a weighted modification after each step of decoding. This idea can be traced back to the 2014 Neuroturing Machine, which is also very similar in form to the 2015 Dynamic Memory Network.

Coherent Dialogue with Attention-based Languge Models

(arXiv: 1611.06997)

Author: Hongyuan Mei, Mohit Bansal, Matthew R. Walter


We model the continuity of coherent conversations through an RNN conversation model with a dynamic attention mechanism. Our Attention-RNN model dynamically increases the range of attention in the history of the conversation as the dialogue continues, and the scope of the standard attention model in the corresponding seq2seq model is fixed. This allows the generated words to be associated with the words associated with the history of the conversation. We evaluated our model in two popular conversational datasets—the open domain's MovieTriples dataset and the closed domain's Ubuntu conversational dataset, and compared to baseline and current most in diversity metrics, manual evaluations, and several metrics. There are no small improvements in the advanced models. Our work also shows that a simple RNN with a dynamic attention mechanism can perform better than complex memory models such as LSTM and GRU after using flexible long-distance memory. Further, we improve the consistency of the conversation by reordering based on the topic model.

Brief comment

In general, the attention mechanism we see is used in the encoder-decoder model, the idea of ​​using the attention mechanism on the RNN in this paper, and the "conversation is more biased towards language modeling than machine The idea of ​​translation work is quite interesting.

Attention-based Memory Selection Recurrent Network for Languge Modeling(arXiv:1611.08656)

Author: Da-Rong Liu, Shun-Pro Chuang, Hung-yi Lee


Recurrent Neural Networks (RNN) have achieved great success in language modeling. However, since RNN uses a fixed-size memory, it cannot store information about all words processed in a sentence, so useful long-term memory is lost when predicting the next word. In this paper, we propose an attention-based memory selection RNN (AMSRN), which can view the information stored in memory at the previous moment and select relevant information from it to assist in generating the output. In AMSRN, the attention mechanism first stores the memory of the relevant information and then extracts the information from it. In the experiment, AMSRN English and Chinese corpus have achieved better results than the LSTM language model. In addition, we studied entropy to regularize the attention weights and use them to visualize the role of the attention mechanism on the language model.

Brief comment

Similar to Coherent Dialogue with Attention-based Languge Models, adding attention directly to RNNLM, the model part is very clear, but the content is less.

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model (arXiv: 1612.07837)

Author: Soroush Mehri, Kundan Kumar, etc.


In this paper we present a new model for unconditional audio generation of an audio sample at a time (unconditional translation cannot). Our model combines autoregressive multi-layer perceptrons with RNNs through a hierarchical structure. Experiments show that such models can capture potential changes in time series over very long spans on three different data sets. Manual evaluation shows that our model works better than other audio generation models. In addition, we show how the different parts of the model contribute to the overall performance of the model.

Brief comment

The hierarchical structure described in this paper is roughly the same, each layer is an RNN, but the frame length of the input speech features of each layer decreases from front to back, and the hidden state output of the previous layer will be used as the next RNN layer. The extra input is equivalent to enhancing the long-term memory of the RNN, so the final model is only slightly worse than the audio sample sequence of length 32 when modeling a sequence of audio samples of length 512. This method of modeling different levels of input at different levels and enhancing each other, and some recent layered attention mechanisms in question and answer can be said to have some commonalities, even if there is no interest in audio generation, it is recommended. Read it a little.

Finally, I recommend a series of courses on deep learning and natural language processing conducted by Zhong Hanting together with Li Yuran and Yan Huiling. The first class is free to listen.