Table of Contents
Fetching ...

Cross Modification Attention Based Deliberation Model for Image Captioning

Zheng Lian, Yanan Zhang, Haichang Li, Rui Wang, Xiaohui Hu

TL;DR

A novel Cross Modification Attention-based Deliberation Model (CMA-DM) is proposed to exploit the complementarity of images and the corresponding coarse captions to supply more reliable features for refinement and employ a potential-oriented reward shaping strategy for reinforcement learning to pertinently improve the quality of refinement.

Abstract

The conventional encoder-decoder framework for image captioning generally adopts a single-pass decoding process, which predicts the target descriptive sentence word by word in temporal order. Despite the great success of this framework, it still suffers from two serious disadvantages. Firstly, it is unable to correct the mistakes in the predicted words, which may mislead the subsequent prediction and result in error accumulation problem. Secondly, such a framework can only leverage the already generated words but not the possible future words, and thus lacks the ability of global planning on linguistic information. To overcome these limitations, we explore a universal two-pass decoding framework, where a single-pass decoding based model serving as the Drafting Model first generates a draft caption according to an input image, and a Deliberation Model then performs the polishing process to refine the draft caption to a better image description. Furthermore, inspired from the complementarity between different modalities, we propose a novel Cross Modification Attention (CMA) module to enhance the semantic expression of the image features and filter out error information from the draft captions. We integrate CMA with the decoder of our Deliberation Model and name it as Cross Modification Attention based Deliberation Model (CMA-DM). We train our proposed framework by jointly optimizing all trainable components from scratch with a trade-off coefficient. Experiments on MS COCO dataset demonstrate that our approach obtains significant improvements over single-pass decoding baselines and achieves competitive performances compared with other state-of-the-art two-pass decoding based methods.

Cross Modification Attention Based Deliberation Model for Image Captioning

TL;DR

A novel Cross Modification Attention-based Deliberation Model (CMA-DM) is proposed to exploit the complementarity of images and the corresponding coarse captions to supply more reliable features for refinement and employ a potential-oriented reward shaping strategy for reinforcement learning to pertinently improve the quality of refinement.

Abstract

The conventional encoder-decoder framework for image captioning generally adopts a single-pass decoding process, which predicts the target descriptive sentence word by word in temporal order. Despite the great success of this framework, it still suffers from two serious disadvantages. Firstly, it is unable to correct the mistakes in the predicted words, which may mislead the subsequent prediction and result in error accumulation problem. Secondly, such a framework can only leverage the already generated words but not the possible future words, and thus lacks the ability of global planning on linguistic information. To overcome these limitations, we explore a universal two-pass decoding framework, where a single-pass decoding based model serving as the Drafting Model first generates a draft caption according to an input image, and a Deliberation Model then performs the polishing process to refine the draft caption to a better image description. Furthermore, inspired from the complementarity between different modalities, we propose a novel Cross Modification Attention (CMA) module to enhance the semantic expression of the image features and filter out error information from the draft captions. We integrate CMA with the decoder of our Deliberation Model and name it as Cross Modification Attention based Deliberation Model (CMA-DM). We train our proposed framework by jointly optimizing all trainable components from scratch with a trade-off coefficient. Experiments on MS COCO dataset demonstrate that our approach obtains significant improvements over single-pass decoding baselines and achieves competitive performances compared with other state-of-the-art two-pass decoding based methods.

Paper Structure

This paper contains 23 sections, 16 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Two examples produced by our two-pass decoding based method. Given an image, the Drafting Model first generates a draft caption, and our CMA-DM then refines it to a better description in the polishing process. For the first example, our CMA-DM corrects "two" to "a" and avoids error accumulation. For the second example, our CMA-DM makes global planning based on the draft caption and generates a more readable descriptive sentence.
  • Figure 2: Multi-Head Attention and Cross Modification Attention. (a) Multi-head attention operates on a single set of key-value pairs and outputs the context vector by applying h parallel scaled dot-product attention modules. (b) CMA operates on both visual and linguistic features and generates modified context vectors using gated operation and residual connection.
  • Figure 3: Overview of our two-pass decoding framework for image captioning. It consists of two interrelated models: the Drafting Model adopts the conventional encoder-decoder framework and generates a draft caption according to an given image in the first-pass decoding process; the Deliberation Model then performs the polishing process to refine the draft caption to a better image description. Note that as the refining encoder is not a necessary structure for every single-pass decoding based model, we will add one to it when necessary to adapt to the architecture of our framework.
  • Figure 4: Main structure of our CMA-DM. It consists of two components: the deliberation encoder projects both visual and linguistic representations into a new feature space; the deliberation decoder takes the projected features as input and refines the draft caption to a better image description. Note that the basic encoder and the refining encoder are not illustrated in this figure for clarity.
  • Figure 5: We design five variants of the CMA module to figure out the best possible network structure for mutual correction. Note that in this group of comparisons, we do not add residual connections to any of these networks.
  • ...and 2 more figures