Get answers and suggestions for various questions from here

How to use depth learning to automatically generate text descriptions for photos?


Image description involves generating a human readable textual description for a given image, such as a photo. This problem is very simple for humans, but it is very difficult for the machine because it involves understanding the content of the image and translating the understood content into natural language.

Recently, in the problem of automatically generating descriptions (called "subtitles") for images, the deep learning method has replaced the classical methods and achieved the current best results. In this article, you'll learn how you can use the Deep Neural Network model to automatically generate descriptions for images such as photos.

After reading this article, you will understand:

  • The difficulty of generating text descriptions for images and the need to combine breakthroughs in computer vision and natural language processing.
  • The components of the neurometric description model (ie, the feature extractor and the language model).
  • How to combine these model elements into the Encoder-Decoder, and perhaps also the attention mechanism.


This article is divided into three parts, namely:

1. Use text to describe the image

2. Neural description model

3. Encoder-decoder structure

Use text to describe the image

Descriptive image refers to the generation of a human-readable textual description of an image, such as an image of an object or scene.

This problem is sometimes referred to as "automatic image annotation" or "image annotation."

This problem is easy for humans, but it is very difficult for the machine.

A quick glance is enough for humans to point out and describe the rich details of a visual scene. But it turns out that our visual identity model is difficult to master such excellent capabilities.
- "Deep Visual - Semantic Alignment for Generating Image Descriptions", 2015

To solve this problem, it is necessary to understand the content of the image, and also to express the meaning of the image in words, and the expressed words must be concatenated in the correct way to be understood. This requires a combination of computer vision and natural language processing, which is a major problem in the field of artificial intelligence in the broad sense.

Automatic description of image content is a fundamental issue in the field of artificial intelligence that connects computer vision and natural language processing.
- "Show and Tell: A Neuro Image Description Generator", 2015

In addition, this problem has different difficulties; let us look at three different variants of this problem through examples.

Classified image

Assign a category label to an image from hundreds or thousands of known categories.

2. Description image

Generate a textual description of the image content.

3. Label the image

Generate a text description for a specific area in the image.

This problem can also be extended to describe the images in the video over time.

In this article, our focus is on describing images, which we call image captioning.

Neural description model

The neural network model has dominated the field of automatic description generation; this is mainly because this method yields the best results currently available.

Prior to the end-to-end neural network model, the two main methods of generating image descriptions were template-based methods and methods based on nearest neighbors and modifying existing descriptions.

There are two approaches that predominate before using neural networks to generate descriptions. The first involves generating a description template that is populated based on the results of target detection and attribute discovery. The second method is to first retrieve the similar images described from a large database and then modify the retrieved descriptions to match the query. ... After the emergence of the dominant neural network approach, both methods have lost their support.
- "Show and Tell: A Neuro Image Description Generator", 2015

The neural network model used for the description involves two main elements:

Feature extraction

2. Language model

Feature extraction model

The feature extraction model is a neural network. Given an image, it can extract significant features, usually represented by a vector of fixed length.

The extracted feature is the internal representation of the image, something that humans can understand directly.

The deep extraction convolutional neural network (CNN) is usually used as a feature extraction submodel. Such a network can be trained directly on the image in the image description data set.

Either a pre-trained model (such as the current best model for image classification) can be used, or a hybrid approach can be used, ie using a pre-trained model and fine-tuning based on actual problems.

It is common practice to use the best performing models developed on the ImageNet dataset for the ILSVRC Challenge, such as the Oxford Vision Geometry Group model, or VGG for short.

... We explored a variety of techniques for over-fitting. The most obvious way to avoid overfitting is to initialize the weight of the CNN component in our system to a pre-trained model.
- "Show and Tell: A Neuro Image Description Generator", 2015

Language model

In general, when a sequence has given some words, the language model can predict the probability of the next word of the sequence.

For image description, a neural network such as a language model can predict a sequence of words in a description based on the extracted features of the network and construct a description based on the words that have been generated.

A common method is to use a cyclic neural network as a language model, such as Long and Short Term Memory Network (LSTM). Each output time step generates a new word in the sequence.

Each generated word is then encoded using a word embedding (such as word2vec) that is passed as input to the decoder to generate subsequent words.

An improved approach to this model is to collect the probability distribution of words in the vocabulary for the output sequence and search for it to generate multiple possible descriptions. These descriptions can be scored and sorted according to likelihood. A common way is to use Beam Search to do this search.

The language model can be trained separately using pre-computed features extracted from the image dataset; it can also be jointly trained using feature extraction networks or some combination methods.

Encoder-decoder architecture

A common method of building submodels is to use an encoder-decoder architecture, where the two models are jointly trained.

The basis of this model is to encode the image into a compact representation of the convolutional neural network, followed by a circular neural network to generate the corresponding sentence. The training goal of this model is to maximize the likelihood of sentences for a given image.
- "Show and Tell: A Neuro Image Description Generator", 2015

This architecture was originally developed for machine translation, where the input sequence (such as French) is encoded into a fixed-length vector by an encoder network. A separate decoder network then reads the codes and generates an output sequence in another language, such as English.

In addition to its ability, the advantage of this approach is that it can train a single end-to-end model on this issue.

When this method is used for image description, the encoder network uses a deep convolutional neural network and the decoder network is a stack of LSTM layers.

In machine translation, the "encoder" RNN reads the source sentence and converts it into an informative, fixed-length vector representation, which in turn is used as the initial hidden state of the "decoder" RNN, which in turn generates the target. sentence. Here we propose to follow this elegant approach and replace the encoder RNN with a deep convolutional neural network (CNN).
- "Show and Tell: A Neuro Image Description Generator", 2015

Description model using the attention mechanism

One limitation of the encoder-decoder is the use of a single fixed length representation to preserve the extracted features.

In machine translation, this problem is solved by a mechanism of attention developed on richer coding, allowing the decoder to learn where to pay attention to each word in the translation.

This approach has also been used to improve the performance level of the encoder-decoder architecture for image descriptions - allowing the decoder to learn which parts of the image should be focused on when generating each word in the description.

Inspired by recent stimuli describing advances in generation and the successful application of attention mechanisms in machine translation and target recognition, we investigated models that can focus on highlights in images when generating descriptions.
- "Show and Tell: A Neuro Image Description Generator", 2015

A big advantage of this approach is that you can accurately visualize where you are paying attention to each word in the description.

We also visually demonstrate how the model automatically learns to focus on prominent objects as it generates corresponding words in the output sequence.
- "Show and Tell: A Neuro Image Description Generator", 2015

The simplest to understand with examples is as follows:

Advanced reading

If you want to learn more about image descriptions, you can refer to the resources given here.




Original link:
Compile: the heart of the machine

July online "deep learning training camp" full battle! Provide GPU environment, employment guidance and introversion, 2 months challenge annual salary of 500,000! Scan the QR code below to listen