ELCC: the Emergent Language Corpus Collection
Brendon Boldt, David Mortensen
TL;DR
ELCC addresses the lack of representative emergent-language corpora by introducing a curated collection of 73 corpora from 7 ECSs, each with rich metadata, corpus data in JSONL, and a standardized suite of analyses. It combines reproducible code and documentation to lower barriers for cross-system analysis, enabling large-scale comparisons and transfer-learning evaluations via XferBench. The resource demonstrates the feasibility and value of broad emergent-language analyses, highlights findings on entropy and transfer performance, and discusses design-improvements and reproducibility challenges. Overall, ELCC serves as a foundational hub for comparative emergent-communication research, facilitating scalable, linguistically-informed investigations and future community contributions.
Abstract
We introduce the Emergent Language Corpus Collection (ELCC): a collection of corpora generated from open source implementations of emergent communication systems across the literature. These systems include a variety of signalling game environments as well as more complex environments like a social deduction game and embodied navigation. Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus (e.g., size, entropy, average message length, performance as transfer learning data). Currently, research studying emergent languages requires directly running different systems which takes time away from actual analyses of such languages, makes studies which compare diverse emergent languages rare, and presents a barrier to entry for researchers without a background in deep learning. The availability of a substantial collection of well-documented emergent language corpora, then, will enable research which can analyze a wider variety of emergent languages, which more effectively uncovers general principles in emergent communication rather than artifacts of particular environments. We provide some quantitative and qualitative analyses with ELCC to demonstrate potential use cases of the resource in this vein.
