Table of Contents
Fetching ...

Computational Sentence-level Metrics Predicting Human Sentence Comprehension

Kun Sun, Rong Wang

TL;DR

This work addresses the gap in modeling sentence-level human comprehension by introducing two metrics—sentence surprisal and sentence relevance—computed with multilingual LLMs. Sentence surprisal uses next-sentence probability or chain-rule-based probabilities, while sentence relevance uses an attention-inspired, memory-weighted semantic similarity across surrounding sentences. Evaluated on the Multilingual Eye-tracking Corpus (MECO) with 13 languages, the metrics predict sentence reading speed via Generalized Additive Mixed Models, with combined surprisal and relevance yielding the strongest cross-linguistic predictive power ($\Delta \text{AIC}$ substantially negative). The findings show that sentence-level metrics generalize across languages and offer interpretable insight into discourse-level processing, supporting closer integration of LLM-based representations with cognitive language processing research. These results have potential to inform cross-linguistic models of reading and to enhance NLP systems with discourse-aware, cognitively plausible metrics.

Abstract

The majority of research in computational psycholinguistics has concentrated on the processing of words. This study introduces innovative methods for computing sentence-level metrics using multilingual large language models. The metrics developed sentence surprisal and sentence relevance and then are tested and compared to validate whether they can predict how humans comprehend sentences as a whole across languages. These metrics offer significant interpretability and achieve high accuracy in predicting human sentence reading speeds. Our results indicate that these computational sentence-level metrics are exceptionally effective at predicting and elucidating the processing difficulties encountered by readers in comprehending sentences as a whole across a variety of languages. Their impressive performance and generalization capabilities provide a promising avenue for future research in integrating LLMs and cognitive science.

Computational Sentence-level Metrics Predicting Human Sentence Comprehension

TL;DR

This work addresses the gap in modeling sentence-level human comprehension by introducing two metrics—sentence surprisal and sentence relevance—computed with multilingual LLMs. Sentence surprisal uses next-sentence probability or chain-rule-based probabilities, while sentence relevance uses an attention-inspired, memory-weighted semantic similarity across surrounding sentences. Evaluated on the Multilingual Eye-tracking Corpus (MECO) with 13 languages, the metrics predict sentence reading speed via Generalized Additive Mixed Models, with combined surprisal and relevance yielding the strongest cross-linguistic predictive power ( substantially negative). The findings show that sentence-level metrics generalize across languages and offer interpretable insight into discourse-level processing, supporting closer integration of LLM-based representations with cognitive language processing research. These results have potential to inform cross-linguistic models of reading and to enhance NLP systems with discourse-aware, cognitively plausible metrics.

Abstract

The majority of research in computational psycholinguistics has concentrated on the processing of words. This study introduces innovative methods for computing sentence-level metrics using multilingual large language models. The metrics developed sentence surprisal and sentence relevance and then are tested and compared to validate whether they can predict how humans comprehend sentences as a whole across languages. These metrics offer significant interpretability and achieve high accuracy in predicting human sentence reading speeds. Our results indicate that these computational sentence-level metrics are exceptionally effective at predicting and elucidating the processing difficulties encountered by readers in comprehending sentences as a whole across a variety of languages. Their impressive performance and generalization capabilities provide a promising avenue for future research in integrating LLMs and cognitive science.
Paper Structure (13 sections, 5 equations, 6 figures, 2 tables)

This paper contains 13 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The computational methods in the present study
  • Figure 2: The overall partial effects of the primary predictors—sentence surprisal and sentence relevance—on reading speed across languages. The x-axis denotes the metric, and the y-axis represents the reading speed. Sentence surprisal and sentence relevance are transformed by logarithm in order to be get closer normal distribution, further having better fittings. Each curve depicts the relationship between a predictor variable and the response variable, reading speed. Steeper slopes on these curves indicate a stronger relationship between the predictor and reading speed, while flatter slopes suggest a weaker effect.
  • Figure 3: The parafoveal-on-foveal effects in reading (from Masato2023polychromy)
  • Figure 4: The memory capability and weights adopted in the attention-aware approach
  • Figure 5: The partial effects of the primary predictors—sentence surprisal and sentence relevance—on reading speed across 13 languages (i.e., Dutch, Estonian, Finnish, German, Greek, Hebrew, Italian, Korean, Norwegian, Russian, Spanish, Turkish). Note: The upper section of the diagram features sentence surprisal, while the lower portion is dedicated to sentence relevance.The x-axis signifies the computational metric, while the y-axis delineates the reading speed. To achieve a closer approximation to a normal distribution, and consequently improve the fitting, all metrics undergo a logarithmic transformation. Each curve visually articulates the correlation between a predictor variable and the response variable, namely reading speed. A steeper incline on these curves underscores a more robust impact between the predictor and reading speed, whereas gentler slopes imply a less pronounced effect. Moreover, when a curve fluctuates around zero, its effect vanishes. The information regarding p-values and $\Delta$AIC is displayed at the top of each plot. The methodology for calculating $\Delta$AIC for "sentence surprisal" and "sentence relevance" is detailed in the main text. In conclusion, sentence surprisal seems to lack significance in English, Korean, and Russian. However, sentence relevance may show no significant impact in English, Russian, Spanish, and Turkish.
  • ...and 1 more figures