Table of Contents
Fetching ...

Transformer Normalisation Layers and the Independence of Semantic Subspaces

Stephen Menary, Samuel Kaski, Andre Freitas

TL;DR

This work investigates how the common Transformer Pre-Norm normalization can disrupt the independence of semantic subspaces, the latent components that drive attention distributions and circuit-like reasoning. It introduces a formal notion of semantic subspaces and contrasts three norm placements: No-Norm, Pre-Norm (input to attention), and QKV-Norm (after the linear operators). Theoretical results show that Pre-Norm imposes strict spherical orthogonality constraints on subspaces for separability, whereas QKV-Norm only requires linear independence. Empirically, the authors bound the norm-spread in Pre-Norm models to roughly ±20% (90% of embeddings) and demonstrate a small but non-negligible circuit-collapse rate under norm perturbations, supporting a potential stability advantage for QKV-Norm in sparse attention regimes. The findings highlight how normalization geometry shapes latent subspace structure, with implications for interpretability and model design in context-rich tasks.

Abstract

Recent works have shown that transformers can solve contextual reasoning tasks by internally executing computational graphs called circuits. Circuits often use attention to logically match information from subspaces of the representation, e.g. using position-in-sequence to identify the previous token. In this work, we consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability unless the model learns a strict representation structure of orthogonal spheres. This is because it causes linear subspaces to interfere through their common normalisation factor. Theoretically, we analyse circuit stability by modelling this interference as random noise on the $L_2$-norms of the query/key/value vectors, predicting a phenomenon of circuit collapse when sparse-attention shifts to a different token. Empirically, we investigate the sensitivity of real-world models trained for mathematical addition, observing a 1% rate of circuit collapse when the norms are artificially perturbed by $\lesssim$10%. We contrast Pre-Norm with QKV-Norm, which places normalisation after the attention head's linear operators. Theoretically this relaxes the representational constraints. Empirically we observe comparable in-distribution but worse out-of-distribution performance.

Transformer Normalisation Layers and the Independence of Semantic Subspaces

TL;DR

This work investigates how the common Transformer Pre-Norm normalization can disrupt the independence of semantic subspaces, the latent components that drive attention distributions and circuit-like reasoning. It introduces a formal notion of semantic subspaces and contrasts three norm placements: No-Norm, Pre-Norm (input to attention), and QKV-Norm (after the linear operators). Theoretical results show that Pre-Norm imposes strict spherical orthogonality constraints on subspaces for separability, whereas QKV-Norm only requires linear independence. Empirically, the authors bound the norm-spread in Pre-Norm models to roughly ±20% (90% of embeddings) and demonstrate a small but non-negligible circuit-collapse rate under norm perturbations, supporting a potential stability advantage for QKV-Norm in sparse attention regimes. The findings highlight how normalization geometry shapes latent subspace structure, with implications for interpretability and model design in context-rich tasks.

Abstract

Recent works have shown that transformers can solve contextual reasoning tasks by internally executing computational graphs called circuits. Circuits often use attention to logically match information from subspaces of the representation, e.g. using position-in-sequence to identify the previous token. In this work, we consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability unless the model learns a strict representation structure of orthogonal spheres. This is because it causes linear subspaces to interfere through their common normalisation factor. Theoretically, we analyse circuit stability by modelling this interference as random noise on the -norms of the query/key/value vectors, predicting a phenomenon of circuit collapse when sparse-attention shifts to a different token. Empirically, we investigate the sensitivity of real-world models trained for mathematical addition, observing a 1% rate of circuit collapse when the norms are artificially perturbed by 10%. We contrast Pre-Norm with QKV-Norm, which places normalisation after the attention head's linear operators. Theoretically this relaxes the representational constraints. Empirically we observe comparable in-distribution but worse out-of-distribution performance.
Paper Structure (32 sections, 12 theorems, 66 equations, 14 figures, 4 tables)

This paper contains 32 sections, 12 theorems, 66 equations, 14 figures, 4 tables.

Key Result

Theorem 1

No-Norm: If two heads with finite non-zero temperature attend to different semantic subspaces, the subspaces must be linearly independent $\mathbb{S}^{N_\alpha}_\alpha \equiv \mathbb{R}^{N_\alpha}$. Corollary: $W_{QK}$ is a low-rank matrix with (left and right) null-spaces that span all non-attended

Figures (14)

  • Figure 1: Spread of embedding $L_2$-norms experienced by attention heads at increasing model depth, excluding the [ token. For Pre-Norm, 90% of the spread is observed within an interval of $\pm20\%$. Supplementary Figure \ref{['fig: embedding L2 norms prenorm head']} shows the distributions used to make this plot. Supplementary Figures \ref{['fig: embedding spread: model variations MID']}-\ref{['fig: embedding spread: model variations END']} replicate the analysis for two model variations.
  • Figure 2: Left: evolution of per-token accuracy as we increase noise on the $L_2$-norms of $\{q,k_t,m_t\}$. A $\gtrsim 10\%$ drop in performance is observed when $1\%$ noise is applied to all layers. Right: applying noise only to $\{q,k_t\}$, we see that non-sparse attention drives the drop at small noise, whereas the sparse case is stable. This is consistent with Theorems \ref{['theorem: stability: sparse']}-\ref{['theorem: stability: isotropic']}, but this interpretation is confounded by the relative importance of non-sparse distributions caused by frequency and depth-dependence.
  • Figure 4: BaselinePre-Norm model predictions after 1 training epoch.
  • Figure 8: Model training curves for the BaselinePre-Norm configuration.
  • Figure 9: Distribution of embedding $L_2$-norms at different model depths using the Baseline Pre-Norm model.
  • ...and 9 more figures

Theorems & Definitions (12)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem 8
  • Theorem 9
  • Theorem 10
  • ...and 2 more