Searching for the Most Human-like Emergent Language
Brendon Boldt, David Mortensen
TL;DR
The study tackles the challenge of making emergent languages resemble human language by optimizing a signalling-game environment with XferBench, a transfer-based language-model measure. It demonstrates that Bayesian hyperparameter search can produce emergent languages that outperform existing corpora in deep transfer to human language, and it reveals a meaningful relation between entropy and transfer performance, including an entropy-based Pareto frontier. Key findings include that large vocabularies (around 10k tokens), increased model capacity, and longer, information-rich messages improve realism, and that entropy acts as both a driver and bound for transfer performance. The work provides practical hyperparameter recommendations and emphasizes entropy minimization as an emergent property, offering a principled path toward realistic synthetic language data for NLP pretraining and evaluation.
Abstract
In this paper, we design a signalling game-based emergent communication environment to generate state-of-the-art emergent languages in terms of similarity to human language. This is done with hyperparameter optimization, using XferBench as the objective function. XferBench quantifies the statistical similarity of emergent language to human language by measuring its suitability for deep transfer learning to human language. Additionally, we demonstrate the predictive power of entropy on the transfer learning performance of emergent language as well as corroborate previous results on the entropy-minimization properties of emergent communication systems. Finally, we report generalizations regarding what hyperparameters produce more realistic emergent languages, that is, ones which transfer better to human language.
