Multimodal Attention for Neural Machine Translation
Ozan Caglayan, Loïc Barrault, Fethi Bougares
TL;DR
The paper investigates multimodal attention for neural machine translation by attending to both image and source-language text. It introduces a four-part MNMT architecture (textual encoder, visual encoder, multimodal attention, and a CGRU decoder) with multiple attention-variant configurations and two fusion methods (SUM and CONCAT). Empirical results on Multi30K show that modality-dependent attention, especially with CONCAT fusion and careful encoder/decoder dependencies, yields consistent gains over textual baselines and improves CIDEr-D substantially, with further boosts when using a best-source-description strategy. This work highlights the value of modality-aware attention in cross-modal translation and provides guidance on effective fusion and attention schemes for vision-language MT.
Abstract
The attention mechanism is an important part of the neural machine translation (NMT) where it was reported to produce richer source representation compared to fixed-length encoding sequence-to-sequence models. Recently, the effectiveness of attention has also been explored in the context of image captioning. In this work, we assess the feasibility of a multimodal attention mechanism that simultaneously focus over an image and its natural language description for generating a description in another language. We train several variants of our proposed attention mechanism on the Multi30k multilingual image captioning dataset. We show that a dedicated attention for each modality achieves up to 1.6 points in BLEU and METEOR compared to a textual NMT baseline.
