Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability
Kingsuk Maitra
TL;DR
Momentum Attention extends the MI Transformer map by embedding time-varying dynamics through a momentum term $p_t = q_t - q_{t-1}$ and a symplectic shear, recasting attention as a phase-space transformation that preserves volume via Liouville's theorem. The authors establish a Symplectic-Filter Duality, proving that the physical shear is equivalent to a High-Pass Filter and enabling Single-Layer Induction and Spectral Forensics through RoPE-based frequency analysis. An Orthogonality Theorem guarantees that DC semantic and AC mechanistic signals occupy orthogonal spectral bands when momentum is applied post-RoPE, a claim validated across $5{,}100+$ experiments and 27 notebooks. A 125M Momentum model matches a 350M baseline within about 2.9% validation loss using 64% fewer parameters, and a scaling law $\gamma^* = 4.17 \times N^{-0.74}$ links momentum coupling to network depth, offering practical guidance for deploying physics-informed inductive mechanisms in transformers.
Abstract
The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator $p_t = q_t - q_{t-1}$, implementing the symplectic shear $\hat{q}_t = q_t + γp_t$ on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution -- by injecting kinematic momentum, we sidestep the topological depth constraint ($L \geq 2$) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A--R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within $\sim$2.9% validation loss. Dedicated associative recall experiments reveal a scaling law $γ^* = 4.17 \times N^{-0.74}$ establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.
