Morpheme Induction for Emergent Language
Brendon Boldt, David Mortensen
TL;DR
CSAR introduces a greedy morpheme-induction algorithm for emergent languages that jointly segments forms and aligns them with meanings by iteratively selecting high mutual-information form–meaning pairs and ablating them from the corpus. The method is validated across procedurally generated data, human-language tasks, and emergent-language settings, showing strong performance for full form–meaning inventories and providing insights into synonymy, polysemy, and compositionality. An open-source Python implementation supports broad applicability, and the analysis demonstrates CSAR’s potential to enable morphosyntactic investigations of emergent languages. Limitations include potential local optima due to greediness and data breadth constraints, motivating future work on non-greedy approaches and larger emergent-language datasets.
Abstract
We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.
