Table of Contents
Fetching ...

LayerNorm Induces Recency Bias in Transformer Decoders

Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi

TL;DR

This work investigates why Transformer decoders exhibit recency bias rather than the earlier-token bias seen with stacked causal self-attention. By analyzing the interaction of causal masking with LayerNorm, residual connections, and input embedding anisotropy, it derives conditions under which recency bias appears, notably showing that LayerNorm can induce $S_{ij}$-level recency for $i \ge j > k$ (i.e., $S_{ij}$ grows with $j$). The findings demonstrate that LayerNorm is a key driver of recency bias, that residual connections do not fully extinguish it, and that embedding anisotropy can amplify it, offering guidance for designing positional encoding strategies and improving length generalization. These insights have practical implications for improving how positional information is encoded in decoder architectures and for developing more robust length-generalization methods in Transformers $($with all math represented in $...$ notation$)$.

Abstract

Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.

LayerNorm Induces Recency Bias in Transformer Decoders

TL;DR

This work investigates why Transformer decoders exhibit recency bias rather than the earlier-token bias seen with stacked causal self-attention. By analyzing the interaction of causal masking with LayerNorm, residual connections, and input embedding anisotropy, it derives conditions under which recency bias appears, notably showing that LayerNorm can induce -level recency for (i.e., grows with ). The findings demonstrate that LayerNorm is a key driver of recency bias, that residual connections do not fully extinguish it, and that embedding anisotropy can amplify it, offering guidance for designing positional encoding strategies and improving length generalization. These insights have practical implications for improving how positional information is encoded in decoder architectures and for developing more robust length-generalization methods in Transformers with all math represented in notation.

Abstract

Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.

Paper Structure

This paper contains 13 sections, 4 theorems, 25 equations, 5 figures.

Key Result

Theorem 1

Let input token embeddings follows $\mathcal{N}(0, \mathbb{I}_d/d)$, and let the architecture be composed of stacked LayerNorm and causal self-attention layers. For hidden sizes $d \gg 1$, the attention score of the second layer $S^{(2)}$ exhibit a recency bias.

Figures (5)

  • Figure 1: Visualization of the attention scores using a simulation. LN and Res correspond to LayerNorm and residual connections, respectively. The y-axis represents query indices, and the x-axis represents key indices.
  • Figure 2: Visualization of $h(j)$ over key index $j$, for multiple values of $\alpha$.
  • Figure 3: Extended results of Figure \ref{['fig:l2norm']} with multiple $\alpha$ values and no residual connections.
  • Figure 4: Extended results of Figure \ref{['fig:l2norm']} with multiple $\alpha$ values and with residual connections.
  • Figure 5: Visualization of $h(j)$ over key index $j$, for multiple values of $\alpha$, including residual connections.

Theorems & Definitions (8)

  • Definition 1
  • Theorem 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • proof
  • proof
  • proof