A Residual-Aware Theory of Position Bias in Transformers

Hanna Herasimchyk; Robin Labryga; Tomislav Prusina; Sören Laue

A Residual-Aware Theory of Position Bias in Transformers

Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, Sören Laue

TL;DR

At finite depth, it is proved that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens, which provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.

Abstract

Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. Under causal masking at infinite depth, prior theoretical analyses of attention rollout predict an inevitable collapse of attention onto the first token. Such collapse, however, does not occur in practice. We resolve this discrepancy with a residual-aware theory of cumulative attention rollout. By incorporating residual connections, we show that this architectural component prevents collapse under realistic conditions. At finite depth, we prove that causal Transformers induce a U-shaped position bias, with attention concentrating on early and late tokens. This result provides a principled architectural explanation for the Lost-in-the-Middle phenomenon.

A Residual-Aware Theory of Position Bias in Transformers

TL;DR

Abstract

Paper Structure (57 sections, 10 theorems, 63 equations, 5 figures, 6 tables)

This paper contains 57 sections, 10 theorems, 63 equations, 5 figures, 6 tables.

Introduction
Contributions.
Related Work
Attention rollout and theoretical position bias.
Empirical primacy and recency bias.
Positional encodings and bias modulation.
Lost-in-the-Middle phenomena.
Mitigating position bias.
Notation and Problem Setup
Tokens and layers.
Masks.
Attention matrices.
Attention logits and content model.
Residual connections and mixing.
Cumulative rollout.
...and 42 more sections

Key Result

Proposition 4.2

Assume causal or causal sliding-window masking and ass:stoch-monotone-A. Let $p^{(t)}$ for $t\in\{1,\dots,T-1\}$ denote the last-row rollout distribution $p^{(t)}(j)=P^{(t)}_{nj}$. Then for every prefix length $k<n$, the prefix mass is monotonically increasing: Consequently, at finite depth the last-row rollout distribution exhibits a systematic drift of mass toward earlier positions.

Figures (5)

Figure 1: Depth-wise effective residual mixing coefficient $\lambda_t$ in pre-trained LLMs. We report the mean $\pm$ 95% confidence interval over 1,000 samples of length 2,048 from the FineWeb-Edu dataset. Most models exhibit decreasing attention contribution with depth.
Figure 2: Final-token influence distributions $p^{(T)}$ for the 70-layer bloom-176b (top) and the 32-layer mpt-7b (bottom). Panels (a)–(c) show controlled rollout variants: (a) attention-only rollout ($\lambda_t=1$, no content) exhibiting collapse to the first token; (b) residual-aware architectural rollout using measured schedules $\{\lambda_t\}$ (no content), producing a broad U-shaped profile; and (c) residual-aware rollout with empirically measured constant-plus-diagonal content, which modulates the U-shape by shifting mass toward later positions. Panel (d) shows the measured input token influence $\hat{p}^{(T)}$, which exhibits a similar U-shaped profile.
Figure 3: Final-token influence distribution $p^{(T)}$ for the 30-layer bloom-7b (top row) and the 48-layer mpt-30b (middle row) and the 36-layer falcon-rw-7b (bottom row), computed using our residual-aware rollout theory (\ref{['sec:setup', 'sec:finite-depth']}) and compared to the empirically estimated final-token input token influence $\hat{p}^{(T)}$ (\ref{['subsec:input-token-influence']}). See \ref{['sec:experiments']} for details.
Figure 4: Depth-wise effective residual mixing $\lambda_t$ coefficient (defined in \ref{['eq:lambda_t']}) for different datasets. $\lambda_t$ quantifies the fraction of each layer's attention update relative to the sum of residual stream and attention contributions.
Figure 5: Mean pre-softmax content-score heatmaps for ALiBi-based models. Rows correspond to falcon-rw-7b, bloom-7b, bloom-176b, and mpt-7b (top to bottom). Each panel shows mean pre-softmax content scores averaged over 1,000 FineWeb-Edu prompts for the indicated layer $t$, head $h$, and sequence length $n$ (reported in the subcaptions).

Theorems & Definitions (22)

Proposition 4.2: Primacy drift under causal and sliding-window masking
Proposition 4.3: Residual strength induces recency drift
Proposition 4.4: Positional encodings induce recency drift
Proposition 4.5: Diagonal content induces a signed recency drift
Lemma 5.3: Uniform Stability and Attention Lower Bound
Theorem 5.4: Residual-aware infinite-depth collapse dichotomy
Definition 1.1: Stochastically monotone (isotone) kernel
Lemma 1.2: Order preservation under a stochastically monotone kernel
proof
Lemma 1.3: Residual kernels inherit stochastic monotonicity
...and 12 more

A Residual-Aware Theory of Position Bias in Transformers

TL;DR

Abstract

A Residual-Aware Theory of Position Bias in Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (22)