Table of Contents
Fetching ...

Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

Kingsuk Maitra

TL;DR

Momentum Attention extends the MI Transformer map by embedding time-varying dynamics through a momentum term $p_t = q_t - q_{t-1}$ and a symplectic shear, recasting attention as a phase-space transformation that preserves volume via Liouville's theorem. The authors establish a Symplectic-Filter Duality, proving that the physical shear is equivalent to a High-Pass Filter and enabling Single-Layer Induction and Spectral Forensics through RoPE-based frequency analysis. An Orthogonality Theorem guarantees that DC semantic and AC mechanistic signals occupy orthogonal spectral bands when momentum is applied post-RoPE, a claim validated across $5{,}100+$ experiments and 27 notebooks. A 125M Momentum model matches a 350M baseline within about 2.9% validation loss using 64% fewer parameters, and a scaling law $\gamma^* = 4.17 \times N^{-0.74}$ links momentum coupling to network depth, offering practical guidance for deploying physics-informed inductive mechanisms in transformers.

Abstract

The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator $p_t = q_t - q_{t-1}$, implementing the symplectic shear $\hat{q}_t = q_t + γp_t$ on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution -- by injecting kinematic momentum, we sidestep the topological depth constraint ($L \geq 2$) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A--R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within $\sim$2.9% validation loss. Dedicated associative recall experiments reveal a scaling law $γ^* = 4.17 \times N^{-0.74}$ establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.

Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

TL;DR

Momentum Attention extends the MI Transformer map by embedding time-varying dynamics through a momentum term and a symplectic shear, recasting attention as a phase-space transformation that preserves volume via Liouville's theorem. The authors establish a Symplectic-Filter Duality, proving that the physical shear is equivalent to a High-Pass Filter and enabling Single-Layer Induction and Spectral Forensics through RoPE-based frequency analysis. An Orthogonality Theorem guarantees that DC semantic and AC mechanistic signals occupy orthogonal spectral bands when momentum is applied post-RoPE, a claim validated across experiments and 27 notebooks. A 125M Momentum model matches a 350M baseline within about 2.9% validation loss using 64% fewer parameters, and a scaling law links momentum coupling to network depth, offering practical guidance for deploying physics-informed inductive mechanisms in transformers.

Abstract

The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator , implementing the symplectic shear on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution -- by injecting kinematic momentum, we sidestep the topological depth constraint () for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A--R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within 2.9% validation loss. Dedicated associative recall experiments reveal a scaling law establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.
Paper Structure (18 sections, 6 theorems, 14 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 18 sections, 6 theorems, 14 equations, 5 figures, 2 tables, 2 algorithms.

Key Result

Theorem 2.2

The kinematic difference operator $\mathcal{K}(q_t) = \gamma(q_t - q_{t-1})$ is the unique linear operator satisfying: (1) Causality, (2) High-Pass Condition, and (3) Symplectic Consistency.

Figures (5)

  • Figure 1: The Induction Circuit and Phase Transition.(A)Left: Standard two-layer induction head requires Layer 1 (Shift) to pass positional information to Layer 2 (Match), using purely DC signals. Right: Our single-layer Momentum Attention injects dynamic AC signals ($p_t = q_t - q_{t-1}$) alongside DC signals, enabling Shift+Match in one layer while unlocking Spectral Forensics. (B) Phase transition from Static Regime to Kinematic Regime at $\gamma_c \approx 0.225$. Standard transformers require $L \geq 2$ layers; Momentum Attention enables Single-Layer Induction. See Appendices B, D, E and Addendum to Appendix D.
  • Figure 2: The Orthogonality Theorem: The "Escape Route."(A) Standard "DC-Coupled" attention processes only semantic states; our "AC-Coupled" Momentum Attention captures both states (DC) and transitions (AC). The Spectral Escape Route emerges when signals occupy orthogonal frequency bands. (B) Empirical frequency response showing DC/AC orthogonality. The critical coupling $\gamma_c$ aligns with induction head emergence. See Appendices E, H.
  • Figure 3: Spectral Forensics: Bode Plot Autopsy.(Top) Kinematic Frame Theory: momentum must be applied Post-RoPE to avoid "Coriolis Error." (Bottom Left) Pre-RoPE: Frame mismatch destroys spectral signal ($r = 0.12$, $-4.1\%$ regression). (Bottom Right) Post-RoPE: Clean high-pass signature ($r = 0.94$, $+52.5\%$ gain). The asymmetry between these outcomes---gain vs. regression, not merely gain vs. parity---is a direct consequence of the spectral complementarity guaranteed by the symplectic conservation law (Section \ref{['sec:complementarity']}). See Appendices F, P.
  • Figure 4: ICL Stress Test.(A) Signal Decay: Standard (red) vs Momentum (blue) across chain depths ($L = 30$). (B) Theoretical Retention: exponential decay ($p^L$) vs linear decay ($1 - cL$). (C) Complexity Scaling: $+52.5\%$ gain from $L = 10$ to $L = 30$. See Appendices N, O.
  • Figure 5: The Validated Physics of Symplectic Attention (Experiments 16 & 18).(A) Single-Layer Induction: Breaking the $N \geq 2$ Barrier. The standard transformer ($\gamma = 0$, red dashed) achieves only random chance (1.2%), while the momentum transformer (green) reaches 83.4% peak accuracy at $\gamma = 4.0$. The phase transition at $\gamma \approx 1.0$ and saturation regime ($\gamma > 4.0$, reflecting position-momentum uncertainty) are clearly visible. (B) The Attenuated Scaling Law: $\gamma^* = 4.17 \times N^{-0.74}$. Sub-linear exponent ($\alpha < 1$) implies signal attenuation across layers, validating the theoretical prediction that momentum and depth are fungible computational resources. See Addendum to Appendix D for complete experimental details across 270+ configurations.

Theorems & Definitions (11)

  • Definition 2.1: Kinematic Momentum Operator
  • Theorem 2.2: Uniqueness of the Momentum Operator
  • proof
  • Theorem 2.3: Preservation of Symplectic Form
  • proof
  • Theorem 2.4: Single-Layer Induction Capability
  • proof
  • Theorem 2.5: Velocity Transfer Function
  • proof
  • Theorem 2.6: Orthogonality of Semantic and Mechanistic Signals
  • ...and 1 more