A Combinatorial Approach to Neural Emergent Communication
Zheyuan Zhang
TL;DR
This work analyzes why emergent communication in Lewis signaling games often relies on a small symbol set, attributing this to sampling pitfalls in training data. It introduces the SolveMinSym (SMS) algorithm to compute the symbolic complexity $ abla\min(|M|)$ by examining attribute-combinations that uniquely identify a target image among distractors, and demonstrates, via synthetic attribute–value datasets, that higher symbolic complexity in data leads to longer effective emergent languages. The experiments with GRU-based sender/receiver models and Gumbel-Softmax optimization show that data with $ abla\min(|M|)=3$ can yield substantially larger gains when increasing maximum message length $L$, supporting the claim that data design can drive language compositionality. Overall, the paper highlights a data-centric path to fostering longer emergent languages without changing model architectures, with implications for understanding and guiding communication protocols in multi-agent systems.
Abstract
Substantial research on deep learning-based emergent communication uses the referential game framework, specifically the Lewis signaling game, however we argue that successful communication in this game typically only need one or two symbols for target image classification because of a sampling pitfall in the training data. To address this issue, we provide a theoretical analysis and introduce a combinatorial algorithm SolveMinSym (SMS) to solve the symbolic complexity for classification, which is the minimum number of symbols in the message for successful communication. We use the SMS algorithm to create datasets with different symbolic complexity to empirically show that data with higher symbolic complexity increases the number of effective symbols in the emergent language.
