Table of Contents
Fetching ...

Multimodal Attention for Neural Machine Translation

Ozan Caglayan, Loïc Barrault, Fethi Bougares

TL;DR

The paper investigates multimodal attention for neural machine translation by attending to both image and source-language text. It introduces a four-part MNMT architecture (textual encoder, visual encoder, multimodal attention, and a CGRU decoder) with multiple attention-variant configurations and two fusion methods (SUM and CONCAT). Empirical results on Multi30K show that modality-dependent attention, especially with CONCAT fusion and careful encoder/decoder dependencies, yields consistent gains over textual baselines and improves CIDEr-D substantially, with further boosts when using a best-source-description strategy. This work highlights the value of modality-aware attention in cross-modal translation and provides guidance on effective fusion and attention schemes for vision-language MT.

Abstract

The attention mechanism is an important part of the neural machine translation (NMT) where it was reported to produce richer source representation compared to fixed-length encoding sequence-to-sequence models. Recently, the effectiveness of attention has also been explored in the context of image captioning. In this work, we assess the feasibility of a multimodal attention mechanism that simultaneously focus over an image and its natural language description for generating a description in another language. We train several variants of our proposed attention mechanism on the Multi30k multilingual image captioning dataset. We show that a dedicated attention for each modality achieves up to 1.6 points in BLEU and METEOR compared to a textual NMT baseline.

Multimodal Attention for Neural Machine Translation

TL;DR

The paper investigates multimodal attention for neural machine translation by attending to both image and source-language text. It introduces a four-part MNMT architecture (textual encoder, visual encoder, multimodal attention, and a CGRU decoder) with multiple attention-variant configurations and two fusion methods (SUM and CONCAT). Empirical results on Multi30K show that modality-dependent attention, especially with CONCAT fusion and careful encoder/decoder dependencies, yields consistent gains over textual baselines and improves CIDEr-D substantially, with further boosts when using a best-source-description strategy. This work highlights the value of modality-aware attention in cross-modal translation and provides guidance on effective fusion and attention schemes for vision-language MT.

Abstract

The attention mechanism is an important part of the neural machine translation (NMT) where it was reported to produce richer source representation compared to fixed-length encoding sequence-to-sequence models. Recently, the effectiveness of attention has also been explored in the context of image captioning. In this work, we assess the feasibility of a multimodal attention mechanism that simultaneously focus over an image and its natural language description for generating a description in another language. We train several variants of our proposed attention mechanism on the Multi30k multilingual image captioning dataset. We show that a dedicated attention for each modality achieves up to 1.6 points in BLEU and METEOR compared to a textual NMT baseline.

Paper Structure

This paper contains 17 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The architecture of MNMT: The boxes with $*$ refer to a linear transformation while $\Phi(\Sigma)$ means a $tanh$ applied over the sum of the inputs.
  • Figure 2: The conceptualization of multimodal attention in terms of different dependency schemes over the source modalities. Common parts from (A) to (C) and (B) to (D) are grayed out to emphasize the changes.
  • Figure 3: The impact of sharing the attention on the attention precision: (Left) Completely independent (shared) attention. (Right) Encoder-dependent attention with independent decoder state projection. (The description in English: a white bird landing in water)