Emergent Communication in a Multi-Modal, Multi-Step Referential Game
Katrina Evtimova, Andrew Drozdov, Douwe Kiela, Kyunghyun Cho
TL;DR
This work introduces a multi-modal, multi-step referential game in which a sender visual module and a receiver textual module exchange bidirectional, high-bandwidth messages over variable-length dialogues up to $T_{\max}=10$. The vocabulary is shared: $S=\{0,1\}^d$, enabling a more language-like emergence of communication between agents with distinct modalities, trained via policy gradient. Four neural architectures (feedforward/attention-based sender and recurrent/attention-based receiver) are analyzed, with attention improving cross-domain generalization and higher message bandwidth enhancing zero-shot transfer. The study reveals that longer dialogues correlate with task difficulty, the receiver’s confidence grows over time, and message entropy shifts as communication progresses, suggesting a progressive refinement akin to natural dialogue. Limitations include incomplete symmetry, fixed binary syntax, and lack of embodied actions, pointing to future work on richer linguistic structure and action-enabled multi-agent systems.
Abstract
Inspired by previous work on emergent communication in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal communication significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.
