Table of Contents
Fetching ...

Imagination improves Multimodal Translation

Desmond Elliott, Ákos Kádár

TL;DR

This paper tackles multimodal translation by decomposing it into learning to translate and learning visually grounded representations via a multitask framework called Imagination. A shared encoder serves both a neural machine translation decoder and an Imaginet image-prediction decoder, enabling grounding without using images as input during translation. The model leverages external data sources (described images and parallel text) to improve performance, achieving state-of-the-art Meteor scores on Multi30K and demonstrating gains with MS COCO and News Commentary data. The work shows that visually grounded source representations can be learned effectively through multitask learning and external resources, with robust improvements in translation quality and image- grounding evidence from ranking and feature-vector analyses.

Abstract

We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations. In a multitask learning framework, translations are learned in an attention-based encoder-decoder, and grounded representations are learned through image representation prediction. Our approach improves translation performance compared to the state of the art on the Multi30K dataset. Furthermore, it is equally effective if we train the image prediction task on the external MS COCO dataset, and we find improvements if we train the translation model on the external News Commentary parallel text.

Imagination improves Multimodal Translation

TL;DR

This paper tackles multimodal translation by decomposing it into learning to translate and learning visually grounded representations via a multitask framework called Imagination. A shared encoder serves both a neural machine translation decoder and an Imaginet image-prediction decoder, enabling grounding without using images as input during translation. The model leverages external data sources (described images and parallel text) to improve performance, achieving state-of-the-art Meteor scores on Multi30K and demonstrating gains with MS COCO and News Commentary data. The work shows that visually grounded source representations can be learned effectively through multitask learning and external resources, with robust improvements in translation quality and image- grounding evidence from ranking and feature-vector analyses.

Abstract

We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations. In a multitask learning framework, translations are learned in an attention-based encoder-decoder, and grounded representations are learned through image representation prediction. Our approach improves translation performance compared to the state of the art on the Multi30K dataset. Furthermore, it is equally effective if we train the image prediction task on the external MS COCO dataset, and we find improvements if we train the translation model on the external News Commentary parallel text.

Paper Structure

This paper contains 19 sections, 13 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: The Imagination model learns visually-grounded representations by sharing the encoder network between the Translation Decoder with image prediction in the imaginet Decoder.
  • Figure 2: We can interpret the imaginet Decoder by visualising the predictions made by our model.