Table of Contents
Fetching ...

A Residual-Aware Theory of Position Bias in Transformers

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, Sören Laue

TL;DR

At finite depth, it is proved that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens, which provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.

Abstract

Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. Under causal masking at infinite depth, prior theoretical analyses of attention rollout predict an inevitable collapse of attention onto the first token. Such collapse, however, does not occur in practice. We resolve this discrepancy with a residual-aware theory of cumulative attention rollout. By incorporating residual connections, we show that this architectural component prevents collapse under realistic conditions. At finite depth, we prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens. This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.

A Residual-Aware Theory of Position Bias in Transformers

TL;DR

At finite depth, it is proved that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens, which provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.

Abstract

Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. Under causal masking at infinite depth, prior theoretical analyses of attention rollout predict an inevitable collapse of attention onto the first token. Such collapse, however, does not occur in practice. We resolve this discrepancy with a residual-aware theory of cumulative attention rollout. By incorporating residual connections, we show that this architectural component prevents collapse under realistic conditions. At finite depth, we prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens. This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.
Paper Structure (57 sections, 10 theorems, 63 equations, 5 figures, 6 tables)

This paper contains 57 sections, 10 theorems, 63 equations, 5 figures, 6 tables.

Key Result

Proposition 4.2

Assume causal or causal sliding-window masking and ass:stoch-monotone-A. Let $p^{(t)}$ for $t\in\{1,\dots,T-1\}$ denote the last-row rollout distribution $p^{(t)}(j)=P^{(t)}_{nj}$. Then for every prefix length $k<n$, the prefix mass is monotonically increasing: Consequently, at finite depth the last-row rollout distribution exhibits a systematic drift of mass toward earlier positions.

Figures (5)

  • Figure 1: Depth-wise effective residual mixing coefficient $\lambda_t$ in pre-trained LLMs. We report the mean $\pm$ 95% confidence interval over 1,000 samples of length 2,048 from the FineWeb-Edu dataset. Most models exhibit decreasing attention contribution with depth.
  • Figure 2: Final-token influence distributions $p^{(T)}$ for the 70-layer bloom-176b (top) and the 32-layer mpt-7b (bottom). Panels (a)–(c) show controlled rollout variants: (a) attention-only rollout ($\lambda_t=1$, no content) exhibiting collapse to the first token; (b) residual-aware architectural rollout using measured schedules $\{\lambda_t\}$ (no content), producing a broad U-shaped profile; and (c) residual-aware rollout with empirically measured constant-plus-diagonal content, which modulates the U-shape by shifting mass toward later positions. Panel (d) shows the measured input token influence $\hat{p}^{(T)}$, which exhibits a similar U-shaped profile.
  • Figure 3: Final-token influence distribution $p^{(T)}$ for the 30-layer bloom-7b (top row) and the 48-layer mpt-30b (middle row) and the 36-layer falcon-rw-7b (bottom row), computed using our residual-aware rollout theory (\ref{['sec:setup', 'sec:finite-depth']}) and compared to the empirically estimated final-token input token influence $\hat{p}^{(T)}$ (\ref{['subsec:input-token-influence']}). See \ref{['sec:experiments']} for details.
  • Figure 4: Depth-wise effective residual mixing $\lambda_t$ coefficient (defined in \ref{['eq:lambda_t']}) for different datasets. $\lambda_t$ quantifies the fraction of each layer's attention update relative to the sum of residual stream and attention contributions.
  • Figure 5: Mean pre-softmax content-score heatmaps for ALiBi-based models. Rows correspond to falcon-rw-7b, bloom-7b, bloom-176b, and mpt-7b (top to bottom). Each panel shows mean pre-softmax content scores averaged over 1,000 FineWeb-Edu prompts for the indicated layer $t$, head $h$, and sequence length $n$ (reported in the subcaptions).

Theorems & Definitions (22)

  • Proposition 4.2: Primacy drift under causal and sliding-window masking
  • Proposition 4.3: Residual strength induces recency drift
  • Proposition 4.4: Positional encodings induce recency drift
  • Proposition 4.5: Diagonal content induces a signed recency drift
  • Lemma 5.3: Uniform Stability and Attention Lower Bound
  • Theorem 5.4: Residual-aware infinite-depth collapse dichotomy
  • Definition 1.1: Stochastically monotone (isotone) kernel
  • Lemma 1.2: Order preservation under a stochastically monotone kernel
  • proof
  • Lemma 1.3: Residual kernels inherit stochastic monotonicity
  • ...and 12 more