Table of Contents
Fetching ...

Morpheme Induction for Emergent Language

Brendon Boldt, David Mortensen

TL;DR

CSAR introduces a greedy morpheme-induction algorithm for emergent languages that jointly segments forms and aligns them with meanings by iteratively selecting high mutual-information form–meaning pairs and ablating them from the corpus. The method is validated across procedurally generated data, human-language tasks, and emergent-language settings, showing strong performance for full form–meaning inventories and providing insights into synonymy, polysemy, and compositionality. An open-source Python implementation supports broad applicability, and the analysis demonstrates CSAR’s potential to enable morphosyntactic investigations of emergent languages. Limitations include potential local optima due to greediness and data breadth constraints, motivating future work on non-greedy approaches and larger emergent-language datasets.

Abstract

We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.

Morpheme Induction for Emergent Language

TL;DR

CSAR introduces a greedy morpheme-induction algorithm for emergent languages that jointly segments forms and aligns them with meanings by iteratively selecting high mutual-information form–meaning pairs and ablating them from the corpus. The method is validated across procedurally generated data, human-language tasks, and emergent-language settings, showing strong performance for full form–meaning inventories and providing insights into synonymy, polysemy, and compositionality. An open-source Python implementation supports broad applicability, and the analysis demonstrates CSAR’s potential to enable morphosyntactic investigations of emergent languages. Limitations include potential local optima due to greediness and data breadth constraints, motivating future work on non-greedy approaches and larger emergent-language datasets.

Abstract

We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.

Paper Structure

This paper contains 63 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of morphemes extracted from a signalling game with pixel observations.
  • Figure 2: Fuzzy $F_1$ scores for CSAR and baseline methods across procedural datasets. Results reported for form--meaning inventories and form-only inventories.
  • Figure 3: Examples of morphemes induced from various human language datasets and tasks.
  • Figure 4: Exact $F_1$ scores of baseline methods on the procedural datasets