Transformer Dynamics: A neuroscientific approach to interpretability of large language models
Jesseba Fernando, Grigori Guitchounts
TL;DR
This paper addresses the interpretability gap in large language models by modeling the transformer residual stream (RS) as a layer-wise dynamical system, borrowing dynamical-systems tools from neuroscience. Using LLama 3.1 8B on WikiText-2, it analyzes RS activations at pre-Attn and pre-MLP across $L=32$ layers ($D=4096$), applying correlations, velocity, mutual information, phase-space portraits, and dimensionality reduction via a compressing autoencoder and PCA. It uncovers that RS units become denser and more correlated across layers, while individual units exhibit rotational, unstable-periodic-like dynamics; in reduced space, RS trajectories form curved, attractor-like paths with self-correcting responses to perturbations, especially in lower layers. These findings establish a principled “neuroscience of AI” framework that links dynamical-systems theory to mechanistic interpretability and could inform future architecture design and training strategies for more robust AI systems. The work also demonstrates the value of combining large-scale data analyses with dynamical insights to understand Transformer computations beyond static representations.
Abstract
As artificial intelligence models have exploded in scale and capability, understanding of their internal mechanisms remains a critical challenge. Inspired by the success of dynamical systems approaches in neuroscience, here we propose a novel framework for studying computations in deep learning systems. We focus on the residual stream (RS) in transformer models, conceptualizing it as a dynamical system evolving across layers. We find that activations of individual RS units exhibit strong continuity across layers, despite the RS being a non-privileged basis. Activations in the RS accelerate and grow denser over layers, while individual units trace unstable periodic orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with attractor-like dynamics in the lower layers. These insights bridge dynamical systems theory and mechanistic interpretability, establishing a foundation for a "neuroscience of AI" that combines theoretical rigor with large-scale data analysis to advance our understanding of modern neural networks.
