Table of Contents
Fetching ...

Recursive numeral systems are highly regular and easy to process

Ponrawee Prasertsom, Andrea Silvi, Jennifer Culbertson, Moa Johansson, Devdatt Dubhashi, Kenny Smith

TL;DR

The paper reframes recursive numeral system efficiency by foregrounding regularity and processing complexity, arguing that MDL-based measures of these properties better separate natural from unattested systems than prior lexicon-size–versus–morphosyntactic-cost trade-offs. It introduces irregularity via minimal partial DFA complexity and processing cost via MDL parsing, applying these to natural languages, random baselines, and prior-optimal systems. The results show natural numeral systems are markedly more regular and easier to process, remaining near local Pareto frontiers under multiple controls and priors. The study highlights regularity as a key driver of human-like numeral systems and proposes methodological extensions to generalize efficiency analyses to broader, formation-based linguistic domains.

Abstract

Previous work has argued that recursive numeral systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Denić and Szymanik, 2024). However, showing that only natural-language-like systems optimise this tradeoff has proven elusive, and the existing solution has relied on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Here, we argue that this issue arises because the proposed trade-off has neglected regularity, a crucial aspect of complexity central to human grammars in general. Drawing on the Minimum Description Length (MDL) approach, we propose that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and unattested but possible ones, including "optimal" recursive numeral systems from previous work, and that the ad-hoc constraints from previous literature naturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies that attempt to measure and explain optimality in language.

Recursive numeral systems are highly regular and easy to process

TL;DR

The paper reframes recursive numeral system efficiency by foregrounding regularity and processing complexity, arguing that MDL-based measures of these properties better separate natural from unattested systems than prior lexicon-size–versus–morphosyntactic-cost trade-offs. It introduces irregularity via minimal partial DFA complexity and processing cost via MDL parsing, applying these to natural languages, random baselines, and prior-optimal systems. The results show natural numeral systems are markedly more regular and easier to process, remaining near local Pareto frontiers under multiple controls and priors. The study highlights regularity as a key driver of human-like numeral systems and proposes methodological extensions to generalize efficiency analyses to broader, formation-based linguistic domains.

Abstract

Previous work has argued that recursive numeral systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Denić and Szymanik, 2024). However, showing that only natural-language-like systems optimise this tradeoff has proven elusive, and the existing solution has relied on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Here, we argue that this issue arises because the proposed trade-off has neglected regularity, a crucial aspect of complexity central to human grammars in general. Drawing on the Minimum Description Length (MDL) approach, we propose that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and unattested but possible ones, including "optimal" recursive numeral systems from previous work, and that the ad-hoc constraints from previous literature naturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies that attempt to measure and explain optimality in language.

Paper Structure

This paper contains 13 sections, 3 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: The minimal partial DFA representing Karo Batak (highly regular) numerals in the range 1 to 99. Circles are states. Double circles are accepting states. Arrows are transitions. $\lambda{}$ denotes the initial state. A given numeral is parsed or generated by traversing the automaton and emitting the symbols (morphemes) associated with the transitions. Accepting states mark valid potential termination points of the parsing or generation. Note that, here, we name each state in accordance with the rules to generate (partial) numerals when the state is reached. For example, the number 10 in Karo Batak, expressed by $1*10$, corresponds to the path from $\lambda$ through the transition for $1$ to the state $D$, then the transition for $*$ to the state $D*$, then the transition for $10$, ending at the state $D*10$.
  • Figure 2: Irregularity (x-axis) and processing complexity (y-axis) of natural languages (blue) and 10,000 randomly generated baseline artificial languages (orange).
  • Figure 3: Irregularity (x-axis) and processing complexity (y-axis) of natural languages (blue, square) and optimal languages in D&S (red, circle) and Y&R (green, triangle)
  • Figure 4: Irregularity (x-axis) and processing complexity (y-axis) of the most (purple) and least (pink) efficient natural-$(D, M, C, L_{Num})$-matched artificial systems, as well as their corresponding natural systems (blue). Natural systems are consistently on the approximately local frontier (cross). Leftmost panel shows a case where the space is very small with one best--occupied by natural systems like Karo Batak--and one worst system. Subsequent panels (Sierra Nahuatl-, Drehu-, Huave- and French-like systems) show cases where there is room for variation, though $L(N\mid G)$ varies to different degrees depending on exact $(D, M, C, L_{Num})$. French-like systems exhibit little variation in $L(N\mid G)$, whereas Nahuatl-like systems exhibit more.
  • Figure 5: Irregularity (x-axis) and unweighted processing complexity (y-axis) of natural systems (blue, square) and optimal systems in D&S (red, circle) and Y&R (green, triangle).
  • ...and 1 more figures