Transformer Block Coupling and its Correlation with Generalization in LLMs
Murdock Aubry, Haoming Meng, Anton Sugolov, Vardan Papyan
TL;DR
We study how token embeddings propagate through transformer blocks by analyzing the Jacobians of block mappings and the top singular vectors to reveal transformer block coupling across depth and across tokens. The authors define coupling metrics, including cross-depth and cross-token alignment, and quantify linearity of token trajectories via the line-shape score and exponential spacing via expodistance, validating these metrics on 30+ LLMs and ViTs. They find that stronger coupling correlates with higher generalization (e.g., $R^2 = 0.8$, $p = 9.99\times 10^{-10}$) and that coupling emerges with training, accompanied by increasing linearity and layer-local coupling; ViTs show similar patterns. This coupling-based perspective provides a new lens on transformer mechanics with potential to guide training and model design for improved generalization.
Abstract
Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. In this work, we analyze the trajectories of token embeddings as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices. By examining the relationships between these block Jacobians, we uncover the phenomenon of \textbf{transformer block coupling} in a multitude of LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling \textit{positively correlates} with model performance, and that this relationship is stronger than with other hyperparameters such as parameter count, model depth, and embedding dimension. We further investigate how these properties emerge during training, observing a progressive development of coupling, increased linearity, and layer-wise exponential growth in token trajectories. Additionally, experiments with Vision Transformers (ViTs) corroborate the emergence of coupling and its relationship with generalization, reinforcing our findings in LLMs. Collectively, these insights offer a novel perspective on token interactions in transformers, opening new directions for studying their mechanisms as well as improving training and generalization.
