Table of Contents
Fetching ...

ELCC: the Emergent Language Corpus Collection

Brendon Boldt, David Mortensen

TL;DR

ELCC addresses the lack of representative emergent-language corpora by introducing a curated collection of 73 corpora from 7 ECSs, each with rich metadata, corpus data in JSONL, and a standardized suite of analyses. It combines reproducible code and documentation to lower barriers for cross-system analysis, enabling large-scale comparisons and transfer-learning evaluations via XferBench. The resource demonstrates the feasibility and value of broad emergent-language analyses, highlights findings on entropy and transfer performance, and discusses design-improvements and reproducibility challenges. Overall, ELCC serves as a foundational hub for comparative emergent-communication research, facilitating scalable, linguistically-informed investigations and future community contributions.

Abstract

We introduce the Emergent Language Corpus Collection (ELCC): a collection of corpora generated from open source implementations of emergent communication systems across the literature. These systems include a variety of signalling game environments as well as more complex environments like a social deduction game and embodied navigation. Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus (e.g., size, entropy, average message length, performance as transfer learning data). Currently, research studying emergent languages requires directly running different systems which takes time away from actual analyses of such languages, makes studies which compare diverse emergent languages rare, and presents a barrier to entry for researchers without a background in deep learning. The availability of a substantial collection of well-documented emergent language corpora, then, will enable research which can analyze a wider variety of emergent languages, which more effectively uncovers general principles in emergent communication rather than artifacts of particular environments. We provide some quantitative and qualitative analyses with ELCC to demonstrate potential use cases of the resource in this vein.

ELCC: the Emergent Language Corpus Collection

TL;DR

ELCC addresses the lack of representative emergent-language corpora by introducing a curated collection of 73 corpora from 7 ECSs, each with rich metadata, corpus data in JSONL, and a standardized suite of analyses. It combines reproducible code and documentation to lower barriers for cross-system analysis, enabling large-scale comparisons and transfer-learning evaluations via XferBench. The resource demonstrates the feasibility and value of broad emergent-language analyses, highlights findings on entropy and transfer performance, and discusses design-improvements and reproducibility challenges. Overall, ELCC serves as a foundational hub for comparative emergent-communication research, facilitating scalable, linguistically-informed investigations and future community contributions.

Abstract

We introduce the Emergent Language Corpus Collection (ELCC): a collection of corpora generated from open source implementations of emergent communication systems across the literature. These systems include a variety of signalling game environments as well as more complex environments like a social deduction game and embodied navigation. Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus (e.g., size, entropy, average message length, performance as transfer learning data). Currently, research studying emergent languages requires directly running different systems which takes time away from actual analyses of such languages, makes studies which compare diverse emergent languages rare, and presents a barrier to entry for researchers without a background in deep learning. The availability of a substantial collection of well-documented emergent language corpora, then, will enable research which can analyze a wider variety of emergent languages, which more effectively uncovers general principles in emergent communication rather than artifacts of particular environments. We provide some quantitative and qualitative analyses with ELCC to demonstrate potential use cases of the resource in this vein.
Paper Structure (35 sections, 6 figures, 2 tables)

This paper contains 35 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The file structure of ELCC.
  • Figure 2: XferBench score across ELCC and human language baselines; lower is better. "No pretrain" baseline illustrated with the line on the plot.
  • Figure 3: Sample utterances from the best and worst performing emergent language corpora on XferBench from ELCC.
  • Figure 4:
  • Figure 5: XferBench scores compared to expected order; lower is better.
  • ...and 1 more figures