Table of Contents
Fetching ...

Emergent Communication in a Multi-Modal, Multi-Step Referential Game

Katrina Evtimova, Andrew Drozdov, Douwe Kiela, Kyunghyun Cho

TL;DR

This work introduces a multi-modal, multi-step referential game in which a sender visual module and a receiver textual module exchange bidirectional, high-bandwidth messages over variable-length dialogues up to $T_{\max}=10$. The vocabulary is shared: $S=\{0,1\}^d$, enabling a more language-like emergence of communication between agents with distinct modalities, trained via policy gradient. Four neural architectures (feedforward/attention-based sender and recurrent/attention-based receiver) are analyzed, with attention improving cross-domain generalization and higher message bandwidth enhancing zero-shot transfer. The study reveals that longer dialogues correlate with task difficulty, the receiver’s confidence grows over time, and message entropy shifts as communication progresses, suggesting a progressive refinement akin to natural dialogue. Limitations include incomplete symmetry, fixed binary syntax, and lack of embodied actions, pointing to future work on richer linguistic structure and action-enabled multi-agent systems.

Abstract

Inspired by previous work on emergent communication in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal communication significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.

Emergent Communication in a Multi-Modal, Multi-Step Referential Game

TL;DR

This work introduces a multi-modal, multi-step referential game in which a sender visual module and a receiver textual module exchange bidirectional, high-bandwidth messages over variable-length dialogues up to . The vocabulary is shared: , enabling a more language-like emergence of communication between agents with distinct modalities, trained via policy gradient. Four neural architectures (feedforward/attention-based sender and recurrent/attention-based receiver) are analyzed, with attention improving cross-domain generalization and higher message bandwidth enhancing zero-shot transfer. The study reveals that longer dialogues correlate with task difficulty, the receiver’s confidence grows over time, and message entropy shifts as communication progresses, suggesting a progressive refinement akin to natural dialogue. Limitations include incomplete symmetry, fixed binary syntax, and lack of embodied actions, pointing to future work on richer linguistic structure and action-enabled multi-agent systems.

Abstract

Inspired by previous work on emergent communication in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal communication significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.

Paper Structure

This paper contains 32 sections, 13 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Visualizing a sender-receiver exchange at time step $t$. See Sec. \ref{['sec:game']} and \ref{['sec:agents']} for more details.
  • Figure 2: (a) Difficulty (measured by F1) versus conversation length across classes. A negative correlation is observed, implying that difficult classes require more turns. (b) Accuracy@$K$ versus conversation length for the in-domain (blue) and out-of-domain (red) test sets.
  • Figure 3: (a) Prediction entropy over the conversation using the in-domain (blue) and out-of-domain (red) test sets. (b, c) Prediction certainty over time in example conversations about Kangaroo and Wolf, respectively.
  • Figure 4: Message entropy over the conversation on the in-domain test set of the sender (left) and receiver (right).
  • Figure 5: Accuracy@$K$ on the In-Domain ($K=6$) and Out-of-Domain ($K=7$) test sets for the Adaptive models of varying message size. We notice the increasing accuracy on the out-of-domain test set as the bandwidth of the channel increases.
  • ...and 1 more figures