Vocabulary shapes cross-lingual variation of word-order learnability in language models

Jonas Mayer Martins; Jaap Jumelet; Viola Priesemann; Lisa Beinborn

Vocabulary shapes cross-lingual variation of word-order learnability in language models

Jonas Mayer Martins, Jaap Jumelet, Viola Priesemann, Lisa Beinborn

Abstract

Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.

Vocabulary shapes cross-lingual variation of word-order learnability in language models

Abstract

Paper Structure (48 sections, 10 equations, 11 figures, 3 tables)

This paper contains 48 sections, 10 equations, 11 figures, 3 tables.

Introduction
Research questions
Synthetic languages
Approach and contributions
Language learnability
Language variation
Computational learnability
Methodology
Synthetic word order
Deterministic shuffling
Our approach
Formal model
Implementation
Experimental setup
Data
...and 33 more sections

Figures (11)

Figure 1: We create a spectrum of synthetic language variants by deterministically permuting words within each sentence. For each sentence length, a permutation is sampled from the Mallows permutation model, where the order parameter $\theta$ controls preference for the original word order. As an example, we show the probability distribution of a word originally at position 16 in a $20$-word sentence.
Figure 2: (a) Surprisal change $\Delta S$ due to word-order perturbations with order $\theta$ for each language (named in panel b). Color shades encode word order: fixed as solid red and free as dashed blue. (b) Zoom-in of surprisal change $\Delta S_\mathrm{irreg}$ at irregular order $\theta = 0$ against the original surprisal $S_\mathrm{orig}$. Red and blue markers are projections onto the axes that indicate a separation of free- and fixed-word-order languages in $S_\mathrm{orig}$ but an overlap in $\Delta S_\mathrm{irreg}$. Transparent bands in panel (a) and error bars in (b) show the 25th to 75th percentile over seeds; the lines and points are the median seed, respectively.
Figure 3: Surprisal difference between word- and subword-level shuffling at irregular word order, with median seed and interquartile ranges.
Figure 4: Percentage of (a) words and (b) subwords in the corpus accounted for by the most frequent vocabulary items. This coverage increases more slowly for languages with freer word order compared to languages with relatively fixed word order (shades of blue and red, respectively).
Figure 5: The absolute surprisal $S(\theta) = S_\mathrm{orig} + \Delta S(\theta)$ per language modeled through a set of vocabulary statistics, encompassing coverage, sentence length, and proxies for morphological complexity. The predictions are cross-validated through leave-one-language-out: Each language is predicted solely on the basis of its own vocabulary statistics by a model trained on the surprisal of the other languages and their predictors.
...and 6 more figures

Vocabulary shapes cross-lingual variation of word-order learnability in language models

Abstract

Vocabulary shapes cross-lingual variation of word-order learnability in language models

Authors

Abstract

Table of Contents

Figures (11)