Imagination improves Multimodal Translation
Desmond Elliott, Ákos Kádár
TL;DR
This paper tackles multimodal translation by decomposing it into learning to translate and learning visually grounded representations via a multitask framework called Imagination. A shared encoder serves both a neural machine translation decoder and an Imaginet image-prediction decoder, enabling grounding without using images as input during translation. The model leverages external data sources (described images and parallel text) to improve performance, achieving state-of-the-art Meteor scores on Multi30K and demonstrating gains with MS COCO and News Commentary data. The work shows that visually grounded source representations can be learned effectively through multitask learning and external resources, with robust improvements in translation quality and image- grounding evidence from ranking and feature-vector analyses.
Abstract
We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations. In a multitask learning framework, translations are learned in an attention-based encoder-decoder, and grounded representations are learned through image representation prediction. Our approach improves translation performance compared to the state of the art on the Multi30K dataset. Furthermore, it is equally effective if we train the image prediction task on the external MS COCO dataset, and we find improvements if we train the translation model on the external News Commentary parallel text.
