Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly
Yi-Chien Lin, William Schuler
TL;DR
The paper investigates whether the inverse scaling between Transformer LM size and the predictive power of surprisal observed for reading times also applies to fMRI data. By evaluating surprisal from 17 Transformer LMs across GPT-2/Neo/OPT families on Natural Stories and Pereira fMRI datasets, and by convolving word-level surprisal with a canonical HRF in linear mixed-effects models, the study tests predictive power against brain responses. The key finding is a robust inverse scaling in both datasets, suggesting that larger models do not necessarily align better with brain data and that this phenomenon generalizes beyond latency measures. This has implications for cognitive neuroscience, highlighting that smaller, more human-aligned LMs may yield more interpretable insights into sentence processing.
Abstract
There has been considerable interest in using surprisal from Transformer-based language models (LMs) as predictors of human sentence processing difficulty. Recent work has observed an inverse scaling relationship between Transformers' per-word estimated probability and the predictive power of their surprisal estimates on reading times, showing that LMs with more parameters and trained on more data are less predictive of human reading times. However, these studies focused on predicting latency-based measures. Tests on brain imaging data have not shown a trend in any direction when using a relatively small set of LMs, leaving open the possibility that the inverse scaling phenomenon is constrained to latency data. This study therefore conducted a more comprehensive evaluation using surprisal estimates from 17 pre-trained LMs across three different LM families on two functional magnetic resonance imaging (fMRI) datasets. Results show that the inverse scaling relationship between models' per-word estimated probability and model fit on both datasets still obtains, resolving the inconclusive results of previous work and indicating that this trend is not specific to latency-based measures.
