Annotating and Inferring Compositional Structures in Numeral Systems Across Languages
Arne Rubehn, Christoph Rzymski, Luca Ciucci, Kellen Parker van Dam, Alžběta Kučerová, Katja Bocklage, David Snee, Abishek Stephen, Johann-Mattis List
TL;DR
The paper tackles the challenge of comparing numeral systems across languages by standardizing annotation and inferring compositional structure. It introduces a computer-assisted workflow, an extended CLDF-based representation, and an evaluation framework for unsupervised morpheme segmentation on a cross-linguistic sample of 25 languages. The results identify allomorphy as the main source of segmentation errors, show Morfessor generally outperforms simpler models, and demonstrate that subword tokenization is not effective for morpheme discovery in low-resource numerals. The work provides a reproducible pipeline and dataset to support typology and NLP research on multilingual morphology, with implications for diachronic analysis and cross-linguistic learning.
Abstract
Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.
