Table of Contents
Fetching ...

Emergent Translation in Multi-Agent Communication

Jason Lee, Kyunghyun Cho, Jason Weston, Douwe Kiela

TL;DR

The paper demonstrates that translation between languages can emerge from a vision-grounded, two-agent communication game that operates without parallel corpora. By grounding both languages in a shared visual modality and jointly training speaker and listener modules, agents learn to translate as a byproduct of solving a referential task, with strong word- and sentence-level results and notable gains in multilingual communities. The work includes comprehensive evaluations on word- and sentence-level translation, zero/low-resource scenarios (including Klingon), and a multilingual setting that boosts learning speed and translation quality. Overall, it provides evidence that grounding language in perception and interactive communication can be a powerful pathway to multilingual translation, with potential for extending to abstract language and multi-task learning.

Abstract

While most machine translation systems to date are trained on large parallel corpora, humans learn language in a different way: by being grounded in an environment and interacting with other humans. In this work, we propose a communication game where two agents, native speakers of their own respective languages, jointly learn to solve a visual referential task. We find that the ability to understand and translate a foreign language emerges as a means to achieve shared goals. The emergent translation is interactive and multimodal, and crucially does not require parallel corpora, but only monolingual, independent text and corresponding images. Our proposed translation model achieves this by grounding the source and target languages into a shared visual modality, and outperforms several baselines on both word-level and sentence-level translation tasks. Furthermore, we show that agents in a multilingual community learn to translate better and faster than in a bilingual communication setting.

Emergent Translation in Multi-Agent Communication

TL;DR

The paper demonstrates that translation between languages can emerge from a vision-grounded, two-agent communication game that operates without parallel corpora. By grounding both languages in a shared visual modality and jointly training speaker and listener modules, agents learn to translate as a byproduct of solving a referential task, with strong word- and sentence-level results and notable gains in multilingual communities. The work includes comprehensive evaluations on word- and sentence-level translation, zero/low-resource scenarios (including Klingon), and a multilingual setting that boosts learning speed and translation quality. Overall, it provides evidence that grounding language in perception and interactive communication can be a powerful pathway to multilingual translation, with potential for extending to abstract language and multi-task learning.

Abstract

While most machine translation systems to date are trained on large parallel corpora, humans learn language in a different way: by being grounded in an environment and interacting with other humans. In this work, we propose a communication game where two agents, native speakers of their own respective languages, jointly learn to solve a visual referential task. We find that the ability to understand and translate a foreign language emerges as a means to achieve shared goals. The emergent translation is interactive and multimodal, and crucially does not require parallel corpora, but only monolingual, independent text and corresponding images. Our proposed translation model achieves this by grounding the source and target languages into a shared visual modality, and outperforms several baselines on both word-level and sentence-level translation tasks. Furthermore, we show that agents in a multilingual community learn to translate better and faster than in a bilingual communication setting.

Paper Structure

This paper contains 41 sections, 4 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Sentence-level communication task and translation between English and Japanese. (a) The red dotted line delimits the agents and the gray dotted line delimits the communication tasks for different languages. Representations residing in the multimodal space of Agent A and B are shown in green and yellow, respectively. (b) An illustration of how the Japanese agent might translate an unseen English sentence to Japanese.
  • Figure 2: Word-level translation results, in precision at $k$. Results are averaged over 30 translation cases (15 two-way pairs).
  • Figure 3: Learning curve for the EN-DE word-level model.
  • Figure 4: Training data in single-pair and community models. M1_EN denotes the English annotations for the first half images in Multi30k. Red and blue indicate training data for the English and the German agents' speaker modules, respectively. Note that compared to the single pair model, English and German speakers see twice the amount of training data in the full model, but see the same number of examples in the fair model.
  • Figure 5: DE-EN learning curve for different models.
  • ...and 9 more figures