Temperature-scaling surprisal estimates improve fit to human reading times -- but does it do so for the "right reasons"?

Tong Liu; Iza Škrjanec; Vera Demberg

Temperature-scaling surprisal estimates improve fit to human reading times -- but does it do so for the "right reasons"?

Tong Liu, Iza Škrjanec, Vera Demberg

TL;DR

This work investigates calibrating large language model predictions to better align surprisal with human reading times. By applying temperature scaling to the LLM outputs, the authors demonstrate substantial improvements in predicting human RTs across three corpora and four GPT-2 variants, with an optimal temperature around $T^* \approx 2.5$. The key finding is that the gains are largely driven by words that are tokenized into multiple subword units, implicating subword-tokenization and rare word processing as central factors. However, the approach worsens calibration metrics such as ECE and CECE, highlighting a trade-off between RT fit and pure probability calibration. The results connect to contextual Rényi entropy, suggesting a broader relationship between softened probability distributions and human-like anticipatory processing in reading.

Abstract

A wide body of evidence shows that human language processing difficulty is predicted by the information-theoretic measure surprisal, a word's negative log probability in context. However, it is still unclear how to best estimate these probabilities needed for predicting human processing difficulty -- while a long-standing belief held that models with lower perplexity would provide more accurate estimates of word predictability, and therefore lead to better reading time predictions, recent work has shown that for very large models, psycholinguistic predictive power decreases. One reason could be that language models might be more confident of their predictions than humans, because they have had exposure to several magnitudes more data. In this paper, we test what effect temperature-scaling of large language model (LLM) predictions has on surprisal estimates and their predictive power of reading times of English texts. Firstly, we show that calibration of large language models typically improves with model size, i.e. poorer calibration cannot account for poorer fit to reading times. Secondly, we find that temperature-scaling probabilities lead to a systematically better fit to reading times (up to 89% improvement in delta log likelihood), across several reading time corpora. Finally, we show that this improvement in fit is chiefly driven by words that are composed of multiple subword tokens.

Temperature-scaling surprisal estimates improve fit to human reading times -- but does it do so for the "right reasons"?

TL;DR

. The key finding is that the gains are largely driven by words that are tokenized into multiple subword units, implicating subword-tokenization and rare word processing as central factors. However, the approach worsens calibration metrics such as ECE and CECE, highlighting a trade-off between RT fit and pure probability calibration. The results connect to contextual Rényi entropy, suggesting a broader relationship between softened probability distributions and human-like anticipatory processing in reading.

Abstract

Paper Structure (36 sections, 4 theorems, 23 equations, 11 figures, 13 tables)

This paper contains 36 sections, 4 theorems, 23 equations, 11 figures, 13 tables.

Introduction
Predictive Power for Reading Times
Methods
Surprisal
Calibration error
Definitions
Expected calibration error (ECE) guo2017calibration
Classwise-ECE (CECE) kumar2019verifiedkull2019beyond
Human-likeness calibration error (HCE)
Temperature-scaled surprisal
Experimental setup
Datasets
Language Models
Metrics and evaluation
Results
...and 21 more sections

Key Result

Theorem 1

(Monotonicity of $s_{T}(w_{t}, T)$ and $\mathrm{H}_{\alpha}(w_{t} \mid \boldsymbol{w}_{<t})$). Given any probability distribution $\boldsymbol{p}$ with actual-word probability $p_{w_{t}} > 1/K$, where $K$ is the number of classes, temperature-scaled surprisal $s_{T}(w_{t}, T)$ is strictly monotonica where $T^{*}$ is the optimal $T$ of fit to RTs in the range of $\Delta_{T}$.

Figures (11)

Figure 1: Temperature-scaled surprisal $s_{T}(w_{t}, T)$ with corresponding $T \in [1, 2.5]$ for two random five-class probability distributions: $p_{i} = [0.8, 0.05, 0.05, 0.05, 0.05]$ and $p_{j} = [0.8, 0.2, 0, 0, 0]$. Dashed lines show Shannon entropy ($\mathrm{H}_{1}$). Loosely dashed lines show Rényi entropy with $\alpha = 1/2$ ($\mathrm{H}_{1/2}$).
Figure 2: Relationship between $\Delta_{\mathrm{llh}}$ of GPT-2 models and corresponding temperature. T is scaled from 1.0 to 10.
Figure 3: Relationship between $\Delta_{\mathrm{MSE}}$ and negative log actual-word probability (surprisal). We take the number of bins to 20. Black dashed lines denote $\Delta_{\mathrm{MSE}} = 0$. Subsets containing less than 1% of data are ignored for each corpus.
Figure 4: Relationship between $\Delta_{\mathrm{llh}}$ of GPT-2 s on three corpora and corresponding temperature T.
Figure 5: A comparison of averaged temperature-scaled surprisal $\overline{s}_{T}|_{T = \{ 1, T^{*}, \infty\}}$ and Rényi entropy $\overline{\mathrm{H}}_{\alpha}|_{\alpha = \{0, 1/2, 1 \}}$.
...and 6 more figures

Theorems & Definitions (4)

Theorem 1
Theorem 2
Theorem 3
Lemma 4

Temperature-scaling surprisal estimates improve fit to human reading times -- but does it do so for the "right reasons"?

TL;DR

Abstract

Temperature-scaling surprisal estimates improve fit to human reading times -- but does it do so for the "right reasons"?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (4)