Table of Contents
Fetching ...

Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias

Borun D Chowdhury

TL;DR

It is shown that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives and establishing what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.

Abstract

The ``Lost in the Middle'' phenomenon -- a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle -- is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emph{the U-shape is already present at initialization, before any training or positional encoding takes effect.} It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Cesàro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated $\mathcal{O}(1)$ anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of order $\mathcal{O}(1/(H{-}1)!)$, where $H$ is the network depth, making middle-context retrieval and training structurally hostile. We validate empirically that untrained Qwen2 and GPT-2 architectures exhibit this U-shape at Step~0, and that it is identical with or without RoPE. Comparing initialized and pretrained networks, we show that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives. We do not claim that this bias is insurmountable, nor that interventions such as RoPE modifications are useless. We establish what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.

Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias

TL;DR

It is shown that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives and establishing what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.

Abstract

The ``Lost in the Middle'' phenomenon -- a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle -- is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emph{the U-shape is already present at initialization, before any training or positional encoding takes effect.} It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Cesàro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of order , where is the network depth, making middle-context retrieval and training structurally hostile. We validate empirically that untrained Qwen2 and GPT-2 architectures exhibit this U-shape at Step~0, and that it is identical with or without RoPE. Comparing initialized and pretrained networks, we show that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives. We do not claim that this bias is insurmountable, nor that interventions such as RoPE modifications are useless. We establish what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.
Paper Structure (43 sections, 70 equations, 7 figures)

This paper contains 43 sections, 70 equations, 7 figures.

Figures (7)

  • Figure 1: Empirical Validation of Position Bias on Qwen2-0.5B at Initialization (Step 0).Top Row (Theory): The exact topological solutions for pure Causal Primacy (Eq. \ref{['eq:causal']}), Hybrid Residual Recency (Eq. \ref{['eq:recency']}) (plotted on a logarithmic scale), and the theoretical irrelevance of positional encodings at initialization. Bottom Row (Empirical): The measured Input-Output Jacobian norm $\rho(x)$ for the 24-layer Qwen2 architecture prior to training. The empirical network tightly recovers the mathematically derived asymmetric U-shape. Crucially, Exp 3 demonstrates that Rotary Positional Embeddings (RoPE) have no structural effect on the topological U-shape at initialization, confirming our hypothesis that middle-context degradation is an inherent property of causal residuals.
  • Figure 2: Jacobian Norm: Initialization vs. Pretrained Qwen2-0.5B ($H=24$, $L=2048$), Vanilla vs. Chunked Context. Averaged over 200 NQ sequences with p16--p84 percentile bands. Top row (Vanilla): Each sequence is a single NQ document truncated to 2048 tokens. Bottom row (Chunked): Each sequence concatenates 300-token excerpts from distinct NQ documents with no separator tokens, boundaries aligned at positions 0, 300, 600, … Left column (Initialization): Both conditions exhibit the smooth Cesàro U-shape with no sensitivity to content or chunk boundaries. Right column (Pretrained): The macroscopic U-shape persists. In the chunked condition, sharp spikes emerge at the 300-token document boundaries, confirming that these are learned content-discontinuity detectors rather than positional artifacts.
  • Figure 3: Evolution of the Jacobian Topology During Early Pretraining (Steps 0--100). Measured on Qwen2-0.5B with a sequence length of $L=2048$. Left (Raw Magnitude): The global gradient norm decreases as the model stabilizes, but the macroscopic U-shaped topology persists rigidly. Right (Anchored): By normalizing the Recency Anchor ($x=1$) to $1.0$, we observe that the relative depth of the "Lost-in-the-Middle" valley actually increases during training. The optimizer does not flatten the combinatorially suppressed middle section, instead relying increasingly on the geometric path of least resistance: the residual Recency Anchor and the logarithmic Primacy Tail.
  • Figure 4: Init: GPT-2 Small ($H=12$)
  • Figure 5: Init: GPT-2 Medium ($H=24$)
  • ...and 2 more figures