Table of Contents
Fetching ...

Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics

James A. Michaelov, Catherine Arnett, Benjamin K. Bergen

TL;DR

This study challenges the view that transformers uniquely capture online human language comprehension by showing contemporary recurrent architectures RWKV and Mamba can match or surpass transformer performance at comparable scale on N400 and related reading-time metrics. Using an identical data backbone (the Pile) and controlled model sizes, the authors evaluate Pythia (transformer) against RWKV and Mamba across 12 datasets and 5 metrics, focusing on surprisal-based predictions of human processing. Across N400 datasets, recurrent models typically provide better fits, with Mamba often delivering the strongest alignment; reading-time results are more dataset-specific and sometimes favor transformers or recurrent models depending on the metric and scaling. The findings imply that transformer superiority in predicting human comprehension is not universal, inviting further exploration of architectural contributions to cognitive plausibility and suggesting recurrent models can offer valuable, complementary perspectives on neural and behavioral language processing.

Abstract

Transformers have generally supplanted recurrent neural networks as the dominant architecture for both natural language processing tasks and for modelling the effect of predictability on online human language comprehension. However, two recently developed recurrent model architectures, RWKV and Mamba, appear to perform natural language tasks comparably to or better than transformers of equivalent scale. In this paper, we show that contemporary recurrent models are now also able to match - and in some cases, exceed - the performance of comparably sized transformers at modeling online human language comprehension. This suggests that transformer language models are not uniquely suited to this task, and opens up new directions for debates about the extent to which architectural features of language models make them better or worse models of human language comprehension.

Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics

TL;DR

This study challenges the view that transformers uniquely capture online human language comprehension by showing contemporary recurrent architectures RWKV and Mamba can match or surpass transformer performance at comparable scale on N400 and related reading-time metrics. Using an identical data backbone (the Pile) and controlled model sizes, the authors evaluate Pythia (transformer) against RWKV and Mamba across 12 datasets and 5 metrics, focusing on surprisal-based predictions of human processing. Across N400 datasets, recurrent models typically provide better fits, with Mamba often delivering the strongest alignment; reading-time results are more dataset-specific and sometimes favor transformers or recurrent models depending on the metric and scaling. The findings imply that transformer superiority in predicting human comprehension is not universal, inviting further exploration of architectural contributions to cognitive plausibility and suggesting recurrent models can offer valuable, complementary perspectives on neural and behavioral language processing.

Abstract

Transformers have generally supplanted recurrent neural networks as the dominant architecture for both natural language processing tasks and for modelling the effect of predictability on online human language comprehension. However, two recently developed recurrent model architectures, RWKV and Mamba, appear to perform natural language tasks comparably to or better than transformers of equivalent scale. In this paper, we show that contemporary recurrent models are now also able to match - and in some cases, exceed - the performance of comparably sized transformers at modeling online human language comprehension. This suggests that transformer language models are not uniquely suited to this task, and opens up new directions for debates about the extent to which architectural features of language models make them better or worse models of human language comprehension.
Paper Structure (39 sections, 3 figures, 6 tables)

This paper contains 39 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Language model performance at predicting N400 amplitude.
  • Figure 2: Language model performance at predicting 4 reading time metrics (see §\ref{['sec:method:datasets']}).
  • Figure 3: Comparison of the WikiText perplexity of each model of each architecture. Word-level perplexity of the WikiText-2 test set merity_2017_PointerSentinelMixture was calculated using the Language Model Evaluation Harness gao_2021_FrameworkFewshotLanguage.