Table of Contents
Fetching ...

On the Geometric Structure of Layer Updates in Deep Language Models

Jun-Sik Yoo

Abstract

We study the geometric structure of layer updates in deep language models. Rather than analyzing what information is encoded in intermediate representations, we ask how representations change from one layer to the next. We show that layerwise updates admit a decomposition into a dominant tokenwise component and a residual that is not captured by restricted tokenwise function classes. Across multiple architectures, including Transformers and state-space models, we find that the full layer update is almost perfectly aligned with the tokenwise component, while the residual exhibits substantially weaker alignment, larger angular deviation, and significantly lower projection onto the dominant tokenwise subspace. This indicates that the residual is not merely a small correction, but a geometrically distinct component of the transformation. This geometric separation has functional consequences: approximation error under the restricted tokenwise model is strongly associated with output perturbation, with Spearman correlations often exceeding 0.7 and reaching up to 0.95 in larger models. Together, these results suggest that most layerwise updates behave like structured reparameterizations along a dominant direction, while functionally significant computation is concentrated in a geometrically distinct residual component. Our framework provides a simple, architecture-agnostic method for probing the geometric and functional structure of layer updates in modern language models.

On the Geometric Structure of Layer Updates in Deep Language Models

Abstract

We study the geometric structure of layer updates in deep language models. Rather than analyzing what information is encoded in intermediate representations, we ask how representations change from one layer to the next. We show that layerwise updates admit a decomposition into a dominant tokenwise component and a residual that is not captured by restricted tokenwise function classes. Across multiple architectures, including Transformers and state-space models, we find that the full layer update is almost perfectly aligned with the tokenwise component, while the residual exhibits substantially weaker alignment, larger angular deviation, and significantly lower projection onto the dominant tokenwise subspace. This indicates that the residual is not merely a small correction, but a geometrically distinct component of the transformation. This geometric separation has functional consequences: approximation error under the restricted tokenwise model is strongly associated with output perturbation, with Spearman correlations often exceeding 0.7 and reaching up to 0.95 in larger models. Together, these results suggest that most layerwise updates behave like structured reparameterizations along a dominant direction, while functionally significant computation is concentrated in a geometrically distinct residual component. Our framework provides a simple, architecture-agnostic method for probing the geometric and functional structure of layer updates in modern language models.

Paper Structure

This paper contains 49 sections, 14 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of the decomposition into a local linear approximation and a residual term. The tokenwise prediction captures the dominant update direction, while the residual reflects deviations from this direction that arise from non-local structure.
  • Figure 2: Residual-output relationship across models and layers. (a) Across architectures, residual error strongly correlates with output deviation. (b) Representative token-level scatter (Pythia-1B). (c) Layer-wise variation of residual--output alignment. (d) residual magnitude varies across layers, revealing structured regimes.
  • Figure 3: Geometric structure of layer updates. The full update is strongly aligned with the tokenwise approximation (left), while the residual exhibits large angular deviation (right), indicating a geometrically distinct component.
  • Figure 4: Projection onto dominant tokenwise subspace. The full and tokenwise updates lie almost entirely within a low-dimensional subspace, while the residual exhibits significantly lower projection, confirming its geometric separation. Results shown for top-1, top-4, and top-8 singular vectors.
  • Figure 5: Sensitivity to locality and rank in Pythia-70M. Left: neighborhood size sweep. Right: rank sweep.