Concept-Best-Matching: Evaluating Compositionality in Emergent Communication

Boaz Carmeli; Yonatan Belinkov; Ron Meir

Concept-Best-Matching: Evaluating Compositionality in Emergent Communication

Boaz Carmeli, Yonatan Belinkov, Ron Meir

TL;DR

This work tackles the challenge of evaluating compositionality in emergent communication by introducing Concept Best Matching (CBM), which constructs a weighted bipartite graph between emergent EC words and natural-language concepts and finds the optimal one-to-one mapping via the Hungarian algorithm. The resulting CBM score, normalized by $Q=\sum_{i\in D}\max(|m_i|,|l_i|)$, yields a global measure of compositionality and an interpretable translation between words and concepts. Experiments on Shape and Thing datasets with GS and QT communication show that CBM aligns with task accuracy and exposes sub-phenomena like ambiguities and paraphrases, offering more fine-grained insights than traditional TopSim or AMI metrics. The results suggest that, while QT tends to perform better than GS, none of the setups achieve the level of compositionality seen in natural language, highlighting the gap between emergent protocols and human-like symbolic language. CBM provides a practical, interpretable diagnostic tool for analyzing and steering the development of EC systems toward more compositional and human-aligned communication.

Abstract

Artificial agents that learn to communicate in order to accomplish a given task acquire communication protocols that are typically opaque to a human. A large body of work has attempted to evaluate the emergent communication via various evaluation measures, with \emph{compositionality} featuring as a prominent desired trait. However, current evaluation procedures do not directly expose the compositionality of the emergent communication. We propose a procedure to assess the compositionality of emergent communication by finding the best-match between emerged words and natural language concepts. The best-match algorithm provides both a global score and a translation-map from emergent words to natural language concepts. To the best of our knowledge, it is the first time that such direct and interpretable mapping between emergent words and human concepts is provided.

Concept-Best-Matching: Evaluating Compositionality in Emergent Communication

TL;DR

, yields a global measure of compositionality and an interpretable translation between words and concepts. Experiments on Shape and Thing datasets with GS and QT communication show that CBM aligns with task accuracy and exposes sub-phenomena like ambiguities and paraphrases, offering more fine-grained insights than traditional TopSim or AMI metrics. The results suggest that, while QT tends to perform better than GS, none of the setups achieve the level of compositionality seen in natural language, highlighting the gap between emergent protocols and human-like symbolic language. CBM provides a practical, interpretable diagnostic tool for analyzing and steering the development of EC systems toward more compositional and human-aligned communication.

Abstract

Paper Structure (24 sections, 3 equations, 4 figures, 4 tables)

This paper contains 24 sections, 3 equations, 4 figures, 4 tables.

Introduction
Background
Emergent Communication Setup
Compositionality in EC
Compositionality Evaluations in EC
Concept Best Matching
Best Match Algorithm
Experimental Setup
Datasets.
Communication Channel.
Results
Example Match.
Conclusion
Limitations
Details on Evaluation Measures
...and 9 more sections

Figures (4)

Figure 1: The multi-target shape game. (a) At each turn, the sender is given a set of images, a subset of them marked as targets by an Oracle. (b) Sender generates messages $[w_1, w_2]$ for the blue-triangle (turn 1) and $[w_2, w_3]$ for the red-triangle (turn 2). (c) During evaluation, we construct a bipartite graph of words generated by the sender and concepts provided by the Oracle for each turn. (d) The best-match algorithm matches EC words to NL concepts and provides the CBM score. In this example, all EC words are matched with NL concepts, resulting in a CBM score of $1.0$.
Figure 2: The $\text{word} \leftrightarrow \text{FVP}$ best-match graph for the Shape game (GS communication, $l=1$). The algorithm matched just $10$ concepts to EC words out of $17$ possible concepts.
Figure 3: The Shape dataset, presenting one turn. Top $8$ images are Yellow targets. Bottom $8$ images are distractors.
Figure 4: Assessing CBM sensitivity to the size of the evaluation set. We show results for two randomly selected experiments. As seen, the CBM score stabilizes when assessing datasets comprising more than $500$ samples.

Concept-Best-Matching: Evaluating Compositionality in Emergent Communication

TL;DR

Abstract

Concept-Best-Matching: Evaluating Compositionality in Emergent Communication

Authors

TL;DR

Abstract

Table of Contents

Figures (4)