Recurrent Self-Attention Dynamics: An Energy-Agnostic Perspective from Jacobians
Akiyoshi Tomihari, Ryo Karakida
TL;DR
This work introduces an energy-agnostic perspective on recurrent self-attention by relaxing traditional energy-based constraints and using dynamical-systems tools to study inference dynamics through Jacobians. It demonstrates that normalization layers, particularly RMSNorm, stabilize SA dynamics by suppressing the Jacobian's spectral norm and taming oscillatory eigenvalues, with Lyapunov exponents near zero correlating with high performance. The authors connect a Jacobian-based pseudo-energy to monitoring inference and show regularization on Jacobian spectra can improve test-time scaling, especially in looped architectures like AKOrN. While energy-based regularization often underperforms in practice, spectral regularization provides a robust mechanism to guide the dynamic regime toward favorable, near-critical behavior. The findings offer practical insights for designing and regularizing looped SA architectures and lay groundwork for further theory on realistic Transformers' inference dynamics.
Abstract
The theoretical understanding of self-attention (SA) has been steadily progressing. A prominent line of work studies a class of SA layers that admit an energy function decreased by state updates. While it provides valuable insights into inherent biases in signal propagation, it often relies on idealized assumptions or additional constraints not necessarily present in standard SA. Thus, to broaden our understanding, this work aims to relax these energy constraints and provide an energy-agnostic characterization of inference dynamics by dynamical systems analysis. In more detail, we first consider relaxing the symmetry and single-head constraints traditionally required in energy-based formulations. Next, we show that analyzing the Jacobian matrix of the state is highly valuable when investigating more general SA architectures without necessarily admitting an energy function. It reveals that the normalization layer plays an essential role in suppressing the Lipschitzness of SA and the Jacobian's complex eigenvalues, which correspond to the oscillatory components of the dynamics. In addition, the Lyapunov exponents computed from the Jacobians demonstrate that the normalized dynamics lie close to a critical state, and this criticality serves as a strong indicator of high inference performance. Furthermore, the Jacobian perspective also enables us to develop regularization methods for training and a pseudo-energy for monitoring inference dynamics.
