Table of Contents
Fetching ...

Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description

Desmond Elliott, Stella Frank, Loïc Barrault, Fethi Bougares, Lucia Specia

TL;DR

This paper reports the results of the second WMT Shared Task on multimodal translation and multilingual image description, introducing French for Task 1 and a test-time image-only setup for Task 2. It analyzes nine participating groups across 19 systems, showing that multimodal approaches often outperform text-only baselines on human judgments, while automatic metrics yield mixed rankings. External data sources consistently boost performance, underscoring the value of unconstrained training resources in small domain datasets. The study also introduces Ambiguous COCO to probe visual disambiguation and emphasizes the need for human evaluation to complement traditional metrics in multimodal multilingual tasks.

Abstract

We present the results from the second shared task on multimodal machine translation and multilingual image description. Nine teams submitted 19 systems to two tasks. The multimodal translation task, in which the source sentence is supplemented by an image, was extended with a new language (French) and two new test sets. The multilingual image description task was changed such that at test time, only the image is given. Compared to last year, multimodal systems improved, but text-only systems remain competitive.

Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description

TL;DR

This paper reports the results of the second WMT Shared Task on multimodal translation and multilingual image description, introducing French for Task 1 and a test-time image-only setup for Task 2. It analyzes nine participating groups across 19 systems, showing that multimodal approaches often outperform text-only baselines on human judgments, while automatic metrics yield mixed rankings. External data sources consistently boost performance, underscoring the value of unconstrained training resources in small domain datasets. The study also introduces Ambiguous COCO to probe visual disambiguation and emphasizes the need for human evaluation to complement traditional metrics in multimodal multilingual tasks.

Abstract

We present the results from the second shared task on multimodal machine translation and multilingual image description. Nine teams submitted 19 systems to two tasks. The multimodal translation task, in which the source sentence is supplemented by an image, was extended with a new language (French) and two new test sets. The multilingual image description task was changed such that at test time, only the image is given. Compared to last year, multimodal systems improved, but text-only systems remain competitive.

Paper Structure

This paper contains 34 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Example of an image with a source description in English, together with German and French translations.
  • Figure 2: Two senses of the English verb "to pass" in their visual contexts, with the original English and the translations into German and French. The verb and its translations are underlined.
  • Figure 3: Example of the human direct assessment evaluation interface.
  • Figure 4: System performance on the English$\rightarrow$German Multi30K 2017 test data as measured by human evaluation against Meteor scores. The AFRL-OHIOSTATE-MULTIMODAL_U system has been ommitted for readability.
  • Figure 5: System performance on the English$\rightarrow$French Multi30K 2017 test data as measured by human evaluation against Meteor scores.
  • ...and 1 more figures