Complexity counts: global and local perspectives on Indo-Aryan numeral systems
Chundra Cathcart
TL;DR
The study tackles the problem of extreme morphophonological irregularity in Indo-Aryan numeral systems by situating them within cross-linguistic numeral typology and quantifying their complexity with four metrics: $MDL$, $N$-gram surprisal, and two Linear Discriminative Learning (LDL) tasks (production and comprehension). Using UniNum and SAND datasets, it demonstrates that South Asia—especially Indo-Aryan languages—exhibits unusually high numeral-system complexity, though certain IA vigesimal varieties show reduced complexity and the historical drivers remain unclear. A key finding is that, despite high overall complexity, IA numerals still adhere to general communicative-efficiency pressures, with complexity gradually decreasing as cardinality increases and higher-frequency forms becoming relatively easier to produce and classify. The paper argues for incorporating integrative complexity into cross-linguistic analyses of numeral systems and outlines future directions, including psycholinguistic experiments and richer historical modeling, to understand the persistence and evolution of IA numeral complexity and its interaction with broader communicative pressures.
Abstract
The numeral systems of Indo-Aryan languages such as Hindi, Gujarati, and Bengali are highly unusual in that unlike most numeral systems (e.g., those of English, Chinese, etc.), forms referring to 1--99 are highly non-transparent and are cannot be constructed using straightforward rules. As an example, Hindi/Urdu *ikyānve* `91' is not decomposable into the composite elements *ek* `one' and *nave* `ninety' in the way that its English counterpart is. This paper situates Indo-Aryan languages within the typology of cross-linguistic numeral systems, and explores the linguistic and non-linguistic factors that may be responsible for the persistence of complex systems in these languages. Using cross-linguistic data from multiple databases, we develop and employ a number of cross-linguistically applicable metrics to quantifies the complexity of languages' numeral systems, and demonstrate that Indo-Aryan languages have decisively more complex numeral systems than the world's languages as a whole, though individual Indo-Aryan languages differ from each other in terms of the complexity of the patterns they display. We investigate the factors (e.g., religion, geographic isolation, etc.) that underlie complexity in numeral systems, with a focus on South Asia, in an attempt to develop an account of why complex numeral systems developed and persisted in certain Indo-Aryan languages but not elsewhere. Finally, we demonstrate that Indo-Aryan numeral systems adhere to certain general pressures toward efficient communication found cross-linguistically, despite their high complexity. We call for this somewhat overlooked dimension of complexity to be taken seriously when discussing general variation in cross-linguistic numeral systems.
