Table of Contents
Fetching ...

Transformer Dynamics: A neuroscientific approach to interpretability of large language models

Jesseba Fernando, Grigori Guitchounts

TL;DR

This paper addresses the interpretability gap in large language models by modeling the transformer residual stream (RS) as a layer-wise dynamical system, borrowing dynamical-systems tools from neuroscience. Using LLama 3.1 8B on WikiText-2, it analyzes RS activations at pre-Attn and pre-MLP across $L=32$ layers ($D=4096$), applying correlations, velocity, mutual information, phase-space portraits, and dimensionality reduction via a compressing autoencoder and PCA. It uncovers that RS units become denser and more correlated across layers, while individual units exhibit rotational, unstable-periodic-like dynamics; in reduced space, RS trajectories form curved, attractor-like paths with self-correcting responses to perturbations, especially in lower layers. These findings establish a principled “neuroscience of AI” framework that links dynamical-systems theory to mechanistic interpretability and could inform future architecture design and training strategies for more robust AI systems. The work also demonstrates the value of combining large-scale data analyses with dynamical insights to understand Transformer computations beyond static representations.

Abstract

As artificial intelligence models have exploded in scale and capability, understanding of their internal mechanisms remains a critical challenge. Inspired by the success of dynamical systems approaches in neuroscience, here we propose a novel framework for studying computations in deep learning systems. We focus on the residual stream (RS) in transformer models, conceptualizing it as a dynamical system evolving across layers. We find that activations of individual RS units exhibit strong continuity across layers, despite the RS being a non-privileged basis. Activations in the RS accelerate and grow denser over layers, while individual units trace unstable periodic orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with attractor-like dynamics in the lower layers. These insights bridge dynamical systems theory and mechanistic interpretability, establishing a foundation for a "neuroscience of AI" that combines theoretical rigor with large-scale data analysis to advance our understanding of modern neural networks.

Transformer Dynamics: A neuroscientific approach to interpretability of large language models

TL;DR

This paper addresses the interpretability gap in large language models by modeling the transformer residual stream (RS) as a layer-wise dynamical system, borrowing dynamical-systems tools from neuroscience. Using LLama 3.1 8B on WikiText-2, it analyzes RS activations at pre-Attn and pre-MLP across layers (), applying correlations, velocity, mutual information, phase-space portraits, and dimensionality reduction via a compressing autoencoder and PCA. It uncovers that RS units become denser and more correlated across layers, while individual units exhibit rotational, unstable-periodic-like dynamics; in reduced space, RS trajectories form curved, attractor-like paths with self-correcting responses to perturbations, especially in lower layers. These findings establish a principled “neuroscience of AI” framework that links dynamical-systems theory to mechanistic interpretability and could inform future architecture design and training strategies for more robust AI systems. The work also demonstrates the value of combining large-scale data analyses with dynamical insights to understand Transformer computations beyond static representations.

Abstract

As artificial intelligence models have exploded in scale and capability, understanding of their internal mechanisms remains a critical challenge. Inspired by the success of dynamical systems approaches in neuroscience, here we propose a novel framework for studying computations in deep learning systems. We focus on the residual stream (RS) in transformer models, conceptualizing it as a dynamical system evolving across layers. We find that activations of individual RS units exhibit strong continuity across layers, despite the RS being a non-privileged basis. Activations in the RS accelerate and grow denser over layers, while individual units trace unstable periodic orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with attractor-like dynamics in the lower layers. These insights bridge dynamical systems theory and mechanistic interpretability, establishing a foundation for a "neuroscience of AI" that combines theoretical rigor with large-scale data analysis to advance our understanding of modern neural networks.

Paper Structure

This paper contains 16 sections, 26 equations, 4 figures.

Figures (4)

  • Figure 1: Transformer residual stream (RS) activations grow dense over the layers, are highly correlated among successive layers, and exhibit nonstationary dynamics. A: Activations of the transformer RS were captured before layernorm and the attention operation (pre-Attn) and before the MLP at each layer of Llama 3.1 8B, resulting in $\mathbf{64 \times 4096}$. 'layers' by 'units'. Activations were analyzed at the last token position for data samples from the wikitext-2-raw-v1 dataset unless otherwise noted. B: Mean activations across $N=1000$ samples. C: Correlations of activations for unit $u$ between layer $l$ and $l+1$ over data samples. For most units, correlations among successive layers increase over the layers. D: Histogram of correlations across layers for each unit. Despite the residual stream not having privileged basis, activations of most units are highly correlated from layer to layer. E: Cosine similarity among pairs of RS vectors $\mathbf{h}_{l}^{Attn} \rightarrow \mathbf{h}_{l}^{MLP}$ (green) and $\mathbf{h}_{l}^{MLP} \rightarrow \mathbf{h}_{l+1}^{Attn}$ (blue). F: Velocity $V$ of the RS vectors. G: Mutual information (MI) among pairs of activations for unit $u$ between layer $l$ and $l+1$ over data samples. H: MI over the layers, averaged across units in the RS.
  • Figure 2: Portraits of Individual RS Units Show Rotational Dynamics akin to Unstable Periodic Orbits. A: Portraits of individual units in activation-gradient space, where the gradient is taken over the 64 effective sublayers. B: Distribution of the estimated number of rotations each unit performs in this phase space compared to a control in which the layer order was shuffled 1000 times for each unit. The mean number of rotations over the layers is 10.74 for the RS units and $\sim{0}$ for the shuffle controls. C: The number of rotations for each for the 4096 units in the RS and their shuffle controls.
  • Figure 3: Compressing Autoencoder (CAE) Shows Dynamics of the RS in Reduced Dimensional Space A: The CAE was trained to pass RS vectors at individual pre-attention and pre-MLP sublayers through a bottleneck, and reconstruct the original vector. Results showing a CAE trained with 10 layers to reduce the dimensionality at the bottleneck to 2. B: Mean trajectory across $n=1000$ test data samples. C: Distance in the reduced space between subsequent layers. D: Explained variance on the test set as a function of the layers.
  • Figure 4: Perturbation of RS trajectories Reveals Self-correcting Dynamics A: Trajectories of $n=1000$ individual (black) and mean (colored by layer) data samples in PCA space. B: Cumulative explained variance of the trajectories as a function of the number of components. C: Explained variance per layer using 100 PC components. D: Perturbation analysis in which trajectories were 'teleported' to various points, at various stages in the RS (indicated by layer number above each subplot). Gray line shows unperturbed control trajectory. Quiver arrows indicate direction and magnitude of teleported trajectories based on the successive 12 sublayers after teleportation.