Table of Contents
Fetching ...

Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning

Longteng Guo, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, Hanqing Lu

TL;DR

High-latency autoregressive image captioning motivates a non-autoregressive alternative. The authors formulate NAIC as a cooperative multi-agent reinforcement learning problem and introduce Counterfactuals-Critical Multi-Agent Learning (CMAL) with an agent-specific counterfactual baseline to achieve sentence-level optimization. They further boost performance using unlabeled images via knowledge distillation, pretraining with XE and then CMAL on real captions. On MSCOCO, NAIC-CMAL delivers comparable CIDEr scores to state-of-the-art autoregressive models while achieving around 13.9x faster decoding, demonstrating practical real-time captioning gains with competitive quality.

Abstract

Most image captioning models are autoregressive, i.e. they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. Recently, non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel. Typically, these models use the word-level cross-entropy loss to optimize each word independently. However, such a learning process fails to consider the sentence-level consistency, thus resulting in inferior generation quality of these non-autoregressive models. In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL). CMAL formulates NAIC as a multi-agent reinforcement learning system where positions in the target sequence are viewed as agents that learn to cooperatively maximize a sentence-level reward. Besides, we propose to utilize massive unlabeled images to boost captioning performance. Extensive experiments on MSCOCO image captioning benchmark show that our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.

Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning

TL;DR

High-latency autoregressive image captioning motivates a non-autoregressive alternative. The authors formulate NAIC as a cooperative multi-agent reinforcement learning problem and introduce Counterfactuals-Critical Multi-Agent Learning (CMAL) with an agent-specific counterfactual baseline to achieve sentence-level optimization. They further boost performance using unlabeled images via knowledge distillation, pretraining with XE and then CMAL on real captions. On MSCOCO, NAIC-CMAL delivers comparable CIDEr scores to state-of-the-art autoregressive models while achieving around 13.9x faster decoding, demonstrating practical real-time captioning gains with competitive quality.

Abstract

Most image captioning models are autoregressive, i.e. they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. Recently, non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel. Typically, these models use the word-level cross-entropy loss to optimize each word independently. However, such a learning process fails to consider the sentence-level consistency, thus resulting in inferior generation quality of these non-autoregressive models. In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL). CMAL formulates NAIC as a multi-agent reinforcement learning system where positions in the target sequence are viewed as agents that learn to cooperatively maximize a sentence-level reward. Besides, we propose to utilize massive unlabeled images to boost captioning performance. Extensive experiments on MSCOCO image captioning benchmark show that our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.

Paper Structure

This paper contains 31 sections, 10 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Given an image, autoregressive image captioning (AIC) model generates a caption word by word, while Non-Autoregressive Image Captioning (NAIC) model outputs all words in parallel.
  • Figure 2: Illustration of our Transformer-based non-autoregressive image captioning model, which composes of an encoder and a decoder. On the rightmost, we cast the non-autoregressive decoder in the multi-agent reinforcement learning terminology.
  • Figure 3: Two examples of the generated captions. GT is a ground-truth caption. NAIC-XE and NAIC-CMAL are our NAIC model after XE and CMAL training, respectively. Repeated words are highlighted in gray.