Emergent Translation in Multi-Agent Communication
Jason Lee, Kyunghyun Cho, Jason Weston, Douwe Kiela
TL;DR
The paper demonstrates that translation between languages can emerge from a vision-grounded, two-agent communication game that operates without parallel corpora. By grounding both languages in a shared visual modality and jointly training speaker and listener modules, agents learn to translate as a byproduct of solving a referential task, with strong word- and sentence-level results and notable gains in multilingual communities. The work includes comprehensive evaluations on word- and sentence-level translation, zero/low-resource scenarios (including Klingon), and a multilingual setting that boosts learning speed and translation quality. Overall, it provides evidence that grounding language in perception and interactive communication can be a powerful pathway to multilingual translation, with potential for extending to abstract language and multi-task learning.
Abstract
While most machine translation systems to date are trained on large parallel corpora, humans learn language in a different way: by being grounded in an environment and interacting with other humans. In this work, we propose a communication game where two agents, native speakers of their own respective languages, jointly learn to solve a visual referential task. We find that the ability to understand and translate a foreign language emerges as a means to achieve shared goals. The emergent translation is interactive and multimodal, and crucially does not require parallel corpora, but only monolingual, independent text and corresponding images. Our proposed translation model achieves this by grounding the source and target languages into a shared visual modality, and outperforms several baselines on both word-level and sentence-level translation tasks. Furthermore, we show that agents in a multilingual community learn to translate better and faster than in a bilingual communication setting.
