Table of Contents
Fetching ...

Confounding Factors in Relating Model Performance to Morphology

Wessel Poelman, Thomas Bauwens, Miryam de Lhoneux

TL;DR

The paper argues that assessing how morphology affects language modeling is hindered by confounding factors in experimental design. It critiques three hypotheses explaining higher perplexities for agglutinative languages and demonstrates that relying on stem-suffix alignment, tokenization efficiency, or data size alone is insufficient. It introduces gradient, token-based proxies—Accessor Variety ($AV$) and entropic efficiency ($\eta$)—computed on token bigrams to predict LM difficulty intrinsically, without expert morphology annotations. The authors advocate for principled experimental setups and show that a gradient view of morphology better explains cross-language LM behavior than coarse morpho-typological groupings, with practical implications for evaluating multilingual models and tokenizers.

Abstract

The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors. Finally, we introduce token bigram metrics as an intrinsic way to predict the difficulty of causal language modeling, and find that they are gradient proxies for morphological complexity that do not require expert annotation. Ultimately, we outline necessities to reliably answer whether, and how, morphology relates to language modeling.

Confounding Factors in Relating Model Performance to Morphology

TL;DR

The paper argues that assessing how morphology affects language modeling is hindered by confounding factors in experimental design. It critiques three hypotheses explaining higher perplexities for agglutinative languages and demonstrates that relying on stem-suffix alignment, tokenization efficiency, or data size alone is insufficient. It introduces gradient, token-based proxies—Accessor Variety () and entropic efficiency ()—computed on token bigrams to predict LM difficulty intrinsically, without expert morphology annotations. The authors advocate for principled experimental setups and show that a gradient view of morphology better explains cross-language LM behavior than coarse morpho-typological groupings, with practical implications for evaluating multilingual models and tokenizers.

Abstract

The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors. Finally, we introduce token bigram metrics as an intrinsic way to predict the difficulty of causal language modeling, and find that they are gradient proxies for morphological complexity that do not require expert annotation. Ultimately, we outline necessities to reliably answer whether, and how, morphology relates to language modeling.

Paper Structure

This paper contains 51 sections, 33 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Computation of our tokenizer-based gradient proxies of morphology (\ref{['sec:metrics']}) for the right accessors of a Finnish subword _kirj: accessor variety (AV), total accessors (TA), uniqueness (AU), and entropic efficiency ($\eta$). The metrics are computed in fixed-size windows for each subword in the vocabulary, to mimic MATTR by covington_cutting_2010. Our metrics better capture the relation between morphology and tokenization compared to word-based or unigram evaluation metrics.
  • Figure 2: Metrics across EuroParl and FineWeb. The bigram metrics are calculated within pretokens, \ref{['fig:additional-sorting-results']} contains results without pretokens. Similar to \ref{['tab:europarl']}, we sort by $\eta$ to show how (dis)similar the metrics are from existing metrics. EuroParl (EP) contains 21 languages; FineWeb (FW) 63. FW $\cap$ EP $=$ 19; FW $\cup$ EP $=$ 65. Full results in table form are in \ref{['apx:results']}. The added coarse groupings are Isolating languages, which tend to have little to no inflection and use few morphemes per word, and Introflective (or non-concatinative) languages, which modify roots and tend to use little to no morphemes.
  • Figure 3: One-sided hypothesis test for a significant reduction in an initially positive difference. Everything left of $\Delta_\alpha$ is significant.
  • Figure 4: The two experimental setups discussed in \ref{['apx:hypothesesdiff']}. Each box is a statistic. Each double arrow is a hypothesis test. The red boxes are the values discussed in the text to be desired as causing a significant hypothesis test (i.e. rejecting $H_0$) when the treatment is effective.
  • Figure 5: Distribution of a $T$-test statistic under the null hypothesis (e.g. "no gap; treatment worked") and the alternative hypothesis (e.g. "gap; treatment did not work"). The purple line indicates the hypothesis threshold $t_{\alpha,\nu}$ under $H_0$. The blue line indicates a value of $T$ which is not significant (it is to the left of $t_\alpha$, or equivalently, its $p$-value under the null hypothesis -- the dark blue area -- is bigger than $\alpha$), yet it would be more likely under $H_1$ (the dark red area is bigger than the dark blue area) despite causing $H_1$ to be rejected.
  • ...and 4 more figures