Table of Contents
Fetching ...

Linear Recency Bias During Training Improves Transformers' Fit to Reading Times

Christian Clark, Byung-Doh Oh, William Schuler

TL;DR

A modification of the Transformer model that uses ALiBi, a recency bias added to attention scores, shows an improved fit to human reading times compared to a standard Transformer baseline, suggesting that ALiBi's mixture of slopes may play a role in the improvement.

Abstract

Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi's mixture of slopes -- which determine the rate of memory decay in each attention head -- may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.

Linear Recency Bias During Training Improves Transformers' Fit to Reading Times

TL;DR

A modification of the Transformer model that uses ALiBi, a recency bias added to attention scores, shows an improved fit to human reading times compared to a standard Transformer baseline, suggesting that ALiBi's mixture of slopes may play a role in the improvement.

Abstract

Recent psycholinguistic research has compared human reading times to surprisal estimates from language models to study the factors shaping human sentence processing difficulty. Previous studies have shown a strong fit between surprisal values from Transformers and reading times. However, standard Transformers work with a lossless representation of the entire previous linguistic context, unlike models of human language processing that include memory decay. To bridge this gap, this paper evaluates a modification of the Transformer model that uses ALiBi (Press et al., 2022), a recency bias added to attention scores. Surprisal estimates with ALiBi show an improved fit to human reading times compared to a standard Transformer baseline. A subsequent analysis of attention heads suggests that ALiBi's mixture of slopes -- which determine the rate of memory decay in each attention head -- may play a role in the improvement by helping models with ALiBi to track different kinds of linguistic dependencies.
Paper Structure (25 sections, 5 equations, 6 figures, 2 tables)

This paper contains 25 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustration of two recency bias techniques tested in this work and defined in Equations \ref{['eq:devarda']} and \ref{['eq:alibi']}. Bias matrices (left) are added to raw attention scores (right). Darker colors indicate higher scores. Hyperparameters $\alpha$, $\lambda$, and $m$ control the strength of the bias, and $\sqrt{d}$ scales the $q_ik_j$ values.
  • Figure 2: Aggregated likelihood results from Experiment 1. Improvements in log likelihood ($\Delta$LogLik) are summed across the corpora in Table \ref{['tab:observations']}. Per-corpus results are in Appendix \ref{['sec:per_corpus']}.
  • Figure 3: Aggregated likelihood results from Experiment 2. Models with a recency bias at inference time only (dVM, ALiBi) are compared against parallel models with a bias included during both training and inference (dVM+train, ALiBi+train). For comparison, the gray dashed line shows the cumulative $\Delta$LogLik from the baseline LM with no recency bias. Per-corpus results are in Appendix \ref{['sec:per_corpus']}.
  • Figure 4: Aggregated likelihood results from Experiment 3. A variant of ALiBi in which all attention heads have the same slope was tested. One set of models included this bias at inference time only (green line), and the other set included the bias during both training and inference (red line). The gray dashed line shows the cumulative $\Delta$LogLik from a baseline LM with no recency bias, and the blue dashed line shows the same measure from an LM including ALiBi with mixed slopes during training and inference. Per-corpus results are in Appendix \ref{['sec:per_corpus']}.
  • Figure 5: Results from Experiment 4. Mean attention scores for three types of dependencies are presented for each attention head in a model with mixed ALiBi slopes. The evaluated model (ALiBi-mix-TI) includes two Transformer layers with four attention heads per layer.
  • ...and 1 more figures