Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics
James A. Michaelov, Catherine Arnett, Benjamin K. Bergen
TL;DR
This study challenges the view that transformers uniquely capture online human language comprehension by showing contemporary recurrent architectures RWKV and Mamba can match or surpass transformer performance at comparable scale on N400 and related reading-time metrics. Using an identical data backbone (the Pile) and controlled model sizes, the authors evaluate Pythia (transformer) against RWKV and Mamba across 12 datasets and 5 metrics, focusing on surprisal-based predictions of human processing. Across N400 datasets, recurrent models typically provide better fits, with Mamba often delivering the strongest alignment; reading-time results are more dataset-specific and sometimes favor transformers or recurrent models depending on the metric and scaling. The findings imply that transformer superiority in predicting human comprehension is not universal, inviting further exploration of architectural contributions to cognitive plausibility and suggesting recurrent models can offer valuable, complementary perspectives on neural and behavioral language processing.
Abstract
Transformers have generally supplanted recurrent neural networks as the dominant architecture for both natural language processing tasks and for modelling the effect of predictability on online human language comprehension. However, two recently developed recurrent model architectures, RWKV and Mamba, appear to perform natural language tasks comparably to or better than transformers of equivalent scale. In this paper, we show that contemporary recurrent models are now also able to match - and in some cases, exceed - the performance of comparably sized transformers at modeling online human language comprehension. This suggests that transformer language models are not uniquely suited to this task, and opens up new directions for debates about the extent to which architectural features of language models make them better or worse models of human language comprehension.
