In the deep learning text classification model, HAN (Hierarchical Attention Network) is a very interesting and worthwhile research model, which not only solves the problem of TextCNN losing text structure information, but also has good classification accuracy in long text, more important. In the modern model, his interpretability is very strong.
Let's take a look at his structure, as shown below:
FIG second sentence above example, an input word vector sequence , the word level by the Bi-GRU, each word will have implicit vector Bi-GRU a corresponding output , and then through each vector to obtain a vector dot product attention Weight, then make the sequence a weighted sum according to the attention weight, get the sentence summary vector , each sentence is added to the same Bi-GRU structure to obtain the final output document feature vector v vector, and then pass the v vector according to the v vector The dense layer adds a classifier to get the final text classification result. The model structure is very consistent with the human understanding process from word->sentence->to chapter.
In the process of tuning, we found that the most essential part of the HAN model is the attention? Why? Because we tried to increase the L2-Loss penalty everywhere in the network, the impact of the model expression ability was very large in the two places of attention. This is probably why the visualization of the attention is very good.
The figure below shows the visualization in the original paper. The red color block on the left side of each line indicates the attention weight of the sentence. The darker the color, the greater the weight of the sentence. The blue color is the weight of the word in each sentence. It should be noted that in order to make the relatively important words in the unimportant sentences also display, the color depth of the words here = sqrt (sentence weight) * word weight. The visual effect in the paper is still very good. In the example of the restaurant rating in the upper left corner, HAN can distinguish the transition. “I don't even like” is not a practical evaluation of the restaurant. In fact, this is a 4 star rating.
Let's take a look at the visualization of our test on the crime prediction task in the Chinese legal field. The model used here is based on the word vector. The result of the visualization below is not to find a very small number of good results, but in most cases the visualization of the model can explain his output. Here the dyeing process is consistent with the practice in the paper, the color depth of the word = sqrt (sentence weight) * word weight, the color depth of the sentence is the attention weight of the direct sentence.
As shown in the following figure, in the very long text, HAN feels that those in the middle are completely nonsense. It is not as useful as the phrase "the public prosecutor thinks that it is useful".
As shown in the figure below, although the model sees the word "stealing" in the middle of the second line of text, he believes that the main event in this case is robbery, which is the advantage of retaining the text structure.
This kind of visualization is very helpful in error analysis. For example, the following is actually intentional damage. The result is judged to be fireproof. If you look at the text, it is easy to find that the intentional damage text has been cut off, causing the model to not find intentional damage. Key words, the first sentence is listed as the key point.
Then change the sentence to see:
It can be seen that the deep learning model is not incomprehensible, and such interpretability will bring a lot of help in the tuning process.
Here are some more columns, so I won't explain them one by one.
Yang, Zichao, et al. "Hierarchical Attention Networks for Document Classification." Proceedings of NAACL-HLT. 2016.