Table of Contents
Fetching ...

Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences

Siquan Li, Yao Tong, Haonan Wang, Tianyang Hu

TL;DR

The paper reveals that randomly initialized Transformers are not neutral: they harbor strong, seed-specific biases that bias next-token predictions even before any training. It develops a mechanistic account in which inter-sequence contraction driven by asymmetric MLP activations and intra-sequence contraction via self-attention align representations along a seed-determined direction, producing pronounced top-token preferences. Crucially, these initialization-induced biases persist through training, enabling SeedPrint, a fingerprinting method capable of distinguishing models by birth seed even under distribution shifts. The authors further connect a positional variance discrepancy in attention to the attention-sink phenomenon and demonstrate practical architectural mitigations—variance calibration strategies that reduce sinks without sacrificing language modeling performance. Together, these results shift focus from what models learn to what they are born with, offering new tools for model attribution and stability control in large-scale LLMs.

Abstract

Transformers underpin modern large language models (LLMs) and are commonly assumed to be behaviorally unstructured at random initialization, with all meaningful preferences emerging only through large-scale training. We challenge this assumption by showing that randomly initialized transformers already exhibit strong and systematic structural biases. In particular, untrained models display extreme token preferences: across random input sequences, certain tokens are predicted with probabilities orders of magnitude larger. We provide a mechanistic explanation for this phenomenon by dissecting the transformer architecture at initialization. We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction. This contraction is driven by two interacting forces: (i) asymmetric nonlinear activations in MLP sublayers induce global (inter-sequence) representation concentration, and (ii) self-attention further amplifies this effect through local (intra-sequence) aggregation. Together, these mechanisms align hidden representations along a direction determined solely by the random initialization, producing highly non-uniform next-token predictions. Beyond mechanistic insight, we demonstrate that these initialization-induced biases persist throughout training, forming a stable and intrinsic model identity. Leveraging this property, we introduce SeedPrint, a fingerprinting method that can reliably distinguish models that differ only in their random initialization, even after extensive training and under substantial distribution shift. Finally, we identify a fundamental positional discrepancy inherent to the attention mechanism's intra-sequence contraction that is causally linked to the attention-sink phenomenon. This discovery provides a principled explanation for the emergence of sinks and offers a pathway for their control.

Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences

TL;DR

The paper reveals that randomly initialized Transformers are not neutral: they harbor strong, seed-specific biases that bias next-token predictions even before any training. It develops a mechanistic account in which inter-sequence contraction driven by asymmetric MLP activations and intra-sequence contraction via self-attention align representations along a seed-determined direction, producing pronounced top-token preferences. Crucially, these initialization-induced biases persist through training, enabling SeedPrint, a fingerprinting method capable of distinguishing models by birth seed even under distribution shifts. The authors further connect a positional variance discrepancy in attention to the attention-sink phenomenon and demonstrate practical architectural mitigations—variance calibration strategies that reduce sinks without sacrificing language modeling performance. Together, these results shift focus from what models learn to what they are born with, offering new tools for model attribution and stability control in large-scale LLMs.

Abstract

Transformers underpin modern large language models (LLMs) and are commonly assumed to be behaviorally unstructured at random initialization, with all meaningful preferences emerging only through large-scale training. We challenge this assumption by showing that randomly initialized transformers already exhibit strong and systematic structural biases. In particular, untrained models display extreme token preferences: across random input sequences, certain tokens are predicted with probabilities orders of magnitude larger. We provide a mechanistic explanation for this phenomenon by dissecting the transformer architecture at initialization. We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction. This contraction is driven by two interacting forces: (i) asymmetric nonlinear activations in MLP sublayers induce global (inter-sequence) representation concentration, and (ii) self-attention further amplifies this effect through local (intra-sequence) aggregation. Together, these mechanisms align hidden representations along a direction determined solely by the random initialization, producing highly non-uniform next-token predictions. Beyond mechanistic insight, we demonstrate that these initialization-induced biases persist throughout training, forming a stable and intrinsic model identity. Leveraging this property, we introduce SeedPrint, a fingerprinting method that can reliably distinguish models that differ only in their random initialization, even after extensive training and under substantial distribution shift. Finally, we identify a fundamental positional discrepancy inherent to the attention mechanism's intra-sequence contraction that is causally linked to the attention-sink phenomenon. This discovery provides a principled explanation for the emergence of sinks and offers a pathway for their control.
Paper Structure (61 sections, 3 theorems, 38 equations, 14 figures, 13 tables, 1 algorithm)

This paper contains 61 sections, 3 theorems, 38 equations, 14 figures, 13 tables, 1 algorithm.

Key Result

Proposition 2

Let $\bm X_1$ and $\bm X_2$ be two independent Gaussian vectors with mean zero and covariance $\sigma^2\bm{I_d}$. Denote $\bm Z_l^{\mathrm{ReLU}}(\bm X)=\mathrm{MLP}_0^{(l)}\circ\cdots\circ\mathrm{MLP}_0^{(1)}(\bm X)$ as the output after $l$ layers of independent MLP mappings with ReLU activation. T In contrast, if ReLU is substituted by tanh, $\mathbb{E}(\bar{\rho}_l^{\mathrm{tanh}})=0$ for any $

Figures (14)

  • Figure 1: Initialized models are not blank states. (a) When conducting next-token prediction on random sequences, randomly initialized transformer exhibits extreme biases where certain tokens are preferred by magnitudes larger than others. For reference, the red dashed line indicates the empirical top-ranked frequencies observed under uniform random sampling. (b) The token representation from random transformers are severely contracted towards a common direction, as indicated by the pairwise cosine similarity of the last-token representation among sequences.
  • Figure 2: Pairwise cosine similarity of last-token representations between different sequences and its evolution with increasing transformer blocks.
  • Figure 3: Next-token preference induced by self-attention and MLP modules separately and combined. The self-attention-only model (orange) is flat, aligning with the empirical random baseline, while the MLP-only (green) and full (blue) models show strong preference.
  • Figure 4: Average and standard deviation of the pairwise cosine similarity between the last-token representations of different sequences, measured after each successive MLP block.
  • Figure 5: Average pairwise cosine similarity analysis. We compare the asymmetric ReLU (blue) against the symmetric tanh (red) in a simplified setting without LayerNorm or residuals to isolate the effect of activation symmetry. The full ReLU based MLP block with residual connections and LayerNorm (green) is included for reference, demonstrating that the contraction phenomenon persists in the standard architecture but is moderated by the residual mechanism.
  • ...and 9 more figures

Theorems & Definitions (5)

  • Definition 1: MLP$_0$ block
  • Proposition 2
  • Definition 3: Attn$_0$ block
  • Proposition 4: self-attention as a contraction amplifier
  • Proposition 5