Testing the Predictions of Surprisal Theory in 11 Languages

Ethan Gotlieb Wilcox; Tiago Pimentel; Clara Meister; Ryan Cotterell; Roger P. Levy

Testing the Predictions of Surprisal Theory in 11 Languages

Ethan Gotlieb Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, Roger P. Levy

TL;DR

The paper tests Surprisal Theory beyond English by analyzing reading-times across eleven languages using the MECO corpus and autoregressive predictors. It evaluates three claims: (i) surprisal predicts reading times, (ii) contextual entropy predicts reading times, and (iii) the surprisal–reading time link is linear. Through regression-based delta log-likelihood comparisons and GAM visualizations, the authors show robust crosslinguistic evidence for surprisal, additional predictive power from contextual entropy, and a nearly linear mapping between surprisal and reading times in diverse languages. The work provides the strongest crosslinguistic support to date for information-theoretic accounts of incremental language processing and informs multilingual modeling of psycholinguistic behavior.

Abstract

A fundamental result in psycholinguistics is that less predictable words take a longer time to process. One theoretical explanation for this finding is Surprisal Theory (Hale, 2001; Levy, 2008), which quantifies a word's predictability as its surprisal, i.e. its negative log-probability given a context. While evidence supporting the predictions of Surprisal Theory have been replicated widely, most have focused on a very narrow slice of data: native English speakers reading English texts. Indeed, no comprehensive multilingual analysis exists. We address this gap in the current literature by investigating the relationship between surprisal and reading times in eleven different languages, distributed across five language families. Deriving estimates from language models trained on monolingual and multilingual corpora, we test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times; (ii) whether expected surprisal, i.e. contextual entropy, is predictive of reading times; (iii) and whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to-date between information theory and incremental language processing across languages.

Testing the Predictions of Surprisal Theory in 11 Languages

TL;DR

Abstract

Paper Structure (32 sections, 4 equations, 7 figures, 1 table)

This paper contains 32 sections, 4 equations, 7 figures, 1 table.

Introduction
Psycholinguistic Predictive Power
Surprisal
Contextual Entropy
Experimental Setup
Dataset
Language Models
Monolingual Models
Multilingual Model
Context Length
Psychological Plausibility
Regression Models
Results
Surprisal
Contextual Entropy
...and 17 more sections

Figures (7)

Figure 1: Predictive Power of Surprisal Across Languages: Positive values mean surprisal contributes to predicting the reading times over a baseline where surprisal is removed. Error bars indicate 95% confidence intervals. Stars indicate the significance of a paired permutation test. We find a consistent significant effect of surprisal across languages for language models that are both multilingual (top row) and monolingual (bottom two rows), and for both progressive gaze duration and total fixation.
Figure 2: Psychometric Predictive Power of Contextual Entropy Across Languages: Positive values mean contextual entropy contributes to predicting the reading times of $w_t$. Error bars are 95% confidence intervals across the 10 folds of held-out data. Stars indicate the significance of a paired permutation test. We find that replacing surprisal with entropy tends to hurt predictive power, while adding entropy tends to help.
Figure 3: Model Coefficients: Coefficients for a linear model that includes surprisal, entropy, frequency and length. Coefficients are shown for each regressor word individually. Zero is indicated with a black line and scales differ for each row. Error bars indicate 95% CIs across folds of data.
Figure 4: Test Perplexity versus $\Delta$ (mGPT): We do not find a significant correlation between the $\Delta$ and mGPT's perplexity for a language or language family.
Figure 5: Surprisal versus Reading Time Relationship: Non-linear GAMs are in green while linear control GAMs are in dotted blue. Shaded regions represent bootstrapped 95% confidence intervals. Results are for gaze duration. Grey subplots indicate the distribution of surprisal values. We find that GAMs recover a linear relationship between surprisal and reading-time slowdown.
...and 2 more figures

Testing the Predictions of Surprisal Theory in 11 Languages

TL;DR

Abstract

Testing the Predictions of Surprisal Theory in 11 Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (7)