Table of Contents
Fetching ...

Transformer Block Coupling and its Correlation with Generalization in LLMs

Murdock Aubry, Haoming Meng, Anton Sugolov, Vardan Papyan

TL;DR

We study how token embeddings propagate through transformer blocks by analyzing the Jacobians of block mappings and the top singular vectors to reveal transformer block coupling across depth and across tokens. The authors define coupling metrics, including cross-depth and cross-token alignment, and quantify linearity of token trajectories via the line-shape score and exponential spacing via expodistance, validating these metrics on 30+ LLMs and ViTs. They find that stronger coupling correlates with higher generalization (e.g., $R^2 = 0.8$, $p = 9.99\times 10^{-10}$) and that coupling emerges with training, accompanied by increasing linearity and layer-local coupling; ViTs show similar patterns. This coupling-based perspective provides a new lens on transformer mechanics with potential to guide training and model design for improved generalization.

Abstract

Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. In this work, we analyze the trajectories of token embeddings as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices. By examining the relationships between these block Jacobians, we uncover the phenomenon of \textbf{transformer block coupling} in a multitude of LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling \textit{positively correlates} with model performance, and that this relationship is stronger than with other hyperparameters such as parameter count, model depth, and embedding dimension. We further investigate how these properties emerge during training, observing a progressive development of coupling, increased linearity, and layer-wise exponential growth in token trajectories. Additionally, experiments with Vision Transformers (ViTs) corroborate the emergence of coupling and its relationship with generalization, reinforcing our findings in LLMs. Collectively, these insights offer a novel perspective on token interactions in transformers, opening new directions for studying their mechanisms as well as improving training and generalization.

Transformer Block Coupling and its Correlation with Generalization in LLMs

TL;DR

We study how token embeddings propagate through transformer blocks by analyzing the Jacobians of block mappings and the top singular vectors to reveal transformer block coupling across depth and across tokens. The authors define coupling metrics, including cross-depth and cross-token alignment, and quantify linearity of token trajectories via the line-shape score and exponential spacing via expodistance, validating these metrics on 30+ LLMs and ViTs. They find that stronger coupling correlates with higher generalization (e.g., , ) and that coupling emerges with training, accompanied by increasing linearity and layer-local coupling; ViTs show similar patterns. This coupling-based perspective provides a new lens on transformer mechanics with potential to guide training and model design for improved generalization.

Abstract

Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. In this work, we analyze the trajectories of token embeddings as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices. By examining the relationships between these block Jacobians, we uncover the phenomenon of \textbf{transformer block coupling} in a multitude of LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling \textit{positively correlates} with model performance, and that this relationship is stronger than with other hyperparameters such as parameter count, model depth, and embedding dimension. We further investigate how these properties emerge during training, observing a progressive development of coupling, increased linearity, and layer-wise exponential growth in token trajectories. Additionally, experiments with Vision Transformers (ViTs) corroborate the emergence of coupling and its relationship with generalization, reinforcing our findings in LLMs. Collectively, these insights offer a novel perspective on token interactions in transformers, opening new directions for studying their mechanisms as well as improving training and generalization.
Paper Structure (32 sections, 28 equations, 33 figures, 2 tables)

This paper contains 32 sections, 28 equations, 33 figures, 2 tables.

Figures (33)

  • Figure 1: Transformer Block Coupling Measurements. (a) The plot illustrates the correlation between average coupling (taking $K=\frac{1}{10}d_\text{model}$) and benchmark scores across LLMs, showing that higher coupling corresponds to improved performance, with a regression fit yielding an $R^2$ value of $0.8$ with a significant p-value of $9.99\times 10^{-10}$. (b) The mean normalized coupling (with $K=10$) is plotted as a function of training checkpoints for Pythia 12B and 6.9B biderman2023pythia, measured at steps ${128, 256, 512, 1k, 2k, \ldots, 128k, 143k}$. (c-e) Adjacency plots illustrate the mean coupling scores between pairs of layers. Each node represents a layer, and edge weight and opacity indicate the strength of depth-wise normalized coupling. Visualizations are provided for checkpoints 1, 4k, and 143k of Pythia 12B.
  • Figure 2: Transformer Block Coupling. A visualization of the various types of transformer block coupling with brief instructions on computing both the Jacobians $J$ and coupling matrices $A$ (Section \ref{['methods:coupling']}). The coupling measurement quantifies the alignment and agreement between the interactions of embeddings connections within the network. The colored subscripts in the sample matrices $A$ indicate the specific connections being compared.
  • Figure 3: Transformer Block Coupling across Depth. The figure shows Jacobian coupling across transformer blocks 9 to 16, using the prompt "What is the capital of France? The capital is" to trace the final token's trajectory. In trained models (bottom row), the diagonal pattern with minimal off-diagonal values indicates alignment of Jacobians, where top singular vectors of $J^{l'}$ diagonalize $J^l$. Untrained models do not exhibit such coupling. Further details are in the Appendix \ref{['appendix:plots']} (Figure \ref{['fig:paris_all_8to16_appendix']}). Best viewed in color.
  • Figure 4: Transformer Block Coupling across Tokens (Self Coupling). The figure shows Jacobian coupling for the same input and output token across tokens, visualized using the absolute values of $A_{ll'}^{ttt't'}$ (with fixed layers $l, l'$). In trained models (bottom row), the strong diagonal and small off-diagonal values indicate coupling, while no such coupling is present at initialization (top row). Additional plots are in Appendix \ref{['appendix:plots']} (Figure \ref{['fig:coupling_inoutsame_appendix']}).
  • Figure 5: Regularity of Trajectories. The line-shape score (LSS) of embedding trajectories, as discussed in Section \ref{['section:linearity']}, computed on 1,200 prompts of the HuggingFace Open LLM Leaderboard (Section \ref{['section:prompt_data']}) for a variety of trained (black) and randomly initialized (blue) LLMs (Appendix \ref{['appendix:suite']}). Median values over all prompts are plotted and are accompanied with uncertainty intervals depicting the inter-quartile range of the results for each model. Models are sorted by number of parameters.
  • ...and 28 more figures