Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

Kingsuk Maitra

Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

Kingsuk Maitra

TL;DR

Momentum Attention extends the MI Transformer map by embedding time-varying dynamics through a momentum term $p_t = q_t - q_{t-1}$ and a symplectic shear, recasting attention as a phase-space transformation that preserves volume via Liouville's theorem. The authors establish a Symplectic-Filter Duality, proving that the physical shear is equivalent to a High-Pass Filter and enabling Single-Layer Induction and Spectral Forensics through RoPE-based frequency analysis. An Orthogonality Theorem guarantees that DC semantic and AC mechanistic signals occupy orthogonal spectral bands when momentum is applied post-RoPE, a claim validated across $5{,}100+$ experiments and 27 notebooks. A 125M Momentum model matches a 350M baseline within about 2.9% validation loss using 64% fewer parameters, and a scaling law $\gamma^* = 4.17 \times N^{-0.74}$ links momentum coupling to network depth, offering practical guidance for deploying physics-informed inductive mechanisms in transformers.

Abstract

The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator $p_t = q_t - q_{t-1}$, implementing the symplectic shear $\hat{q}_t = q_t + γp_t$ on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution -- by injecting kinematic momentum, we sidestep the topological depth constraint ($L \geq 2$) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A--R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within $\sim$2.9% validation loss. Dedicated associative recall experiments reveal a scaling law $γ^* = 4.17 \times N^{-0.74}$ establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.

Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

TL;DR

Momentum Attention extends the MI Transformer map by embedding time-varying dynamics through a momentum term

and a symplectic shear, recasting attention as a phase-space transformation that preserves volume via Liouville's theorem. The authors establish a Symplectic-Filter Duality, proving that the physical shear is equivalent to a High-Pass Filter and enabling Single-Layer Induction and Spectral Forensics through RoPE-based frequency analysis. An Orthogonality Theorem guarantees that DC semantic and AC mechanistic signals occupy orthogonal spectral bands when momentum is applied post-RoPE, a claim validated across

experiments and 27 notebooks. A 125M Momentum model matches a 350M baseline within about 2.9% validation loss using 64% fewer parameters, and a scaling law

links momentum coupling to network depth, offering practical guidance for deploying physics-informed inductive mechanisms in transformers.

Abstract

, implementing the symplectic shear

on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution -- by injecting kinematic momentum, we sidestep the topological depth constraint (

) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A--R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within

2.9% validation loss. Dedicated associative recall experiments reveal a scaling law

establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.

Paper Structure (18 sections, 6 theorems, 14 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 18 sections, 6 theorems, 14 equations, 5 figures, 2 tables, 2 algorithms.

Introduction
Momentum Attention as a Symplectic Shear
Phase Space Formulation and Uniqueness
The Symplectic Shear Transformation
The Hamiltonian Shortcut: Single-Layer Induction
The Expanded Attention Expression
The Transfer Function and Filter Dynamics
The Orthogonality Theorem (The Escape Route)
The Placement Corollary (Post-RoPE Validity)
Spectral Complementarity: The Conservation Law Consequence
Empirical Validation
Spectral Forensics: Theory Meets Experiment
Task Dissociation: The High-Pass Signature
Efficiency at Scale: David vs. Goliath
ICL Stress Test: Probing the Limits
...and 3 more sections

Key Result

Theorem 2.2

The kinematic difference operator $\mathcal{K}(q_t) = \gamma(q_t - q_{t-1})$ is the unique linear operator satisfying: (1) Causality, (2) High-Pass Condition, and (3) Symplectic Consistency.

Figures (5)

Figure 1: The Induction Circuit and Phase Transition.(A)Left: Standard two-layer induction head requires Layer 1 (Shift) to pass positional information to Layer 2 (Match), using purely DC signals. Right: Our single-layer Momentum Attention injects dynamic AC signals ($p_t = q_t - q_{t-1}$) alongside DC signals, enabling Shift+Match in one layer while unlocking Spectral Forensics. (B) Phase transition from Static Regime to Kinematic Regime at $\gamma_c \approx 0.225$. Standard transformers require $L \geq 2$ layers; Momentum Attention enables Single-Layer Induction. See Appendices B, D, E and Addendum to Appendix D.
Figure 2: The Orthogonality Theorem: The "Escape Route."(A) Standard "DC-Coupled" attention processes only semantic states; our "AC-Coupled" Momentum Attention captures both states (DC) and transitions (AC). The Spectral Escape Route emerges when signals occupy orthogonal frequency bands. (B) Empirical frequency response showing DC/AC orthogonality. The critical coupling $\gamma_c$ aligns with induction head emergence. See Appendices E, H.
Figure 3: Spectral Forensics: Bode Plot Autopsy.(Top) Kinematic Frame Theory: momentum must be applied Post-RoPE to avoid "Coriolis Error." (Bottom Left) Pre-RoPE: Frame mismatch destroys spectral signal ($r = 0.12$, $-4.1\%$ regression). (Bottom Right) Post-RoPE: Clean high-pass signature ($r = 0.94$, $+52.5\%$ gain). The asymmetry between these outcomes---gain vs. regression, not merely gain vs. parity---is a direct consequence of the spectral complementarity guaranteed by the symplectic conservation law (Section \ref{['sec:complementarity']}). See Appendices F, P.
Figure 4: ICL Stress Test.(A) Signal Decay: Standard (red) vs Momentum (blue) across chain depths ($L = 30$). (B) Theoretical Retention: exponential decay ($p^L$) vs linear decay ($1 - cL$). (C) Complexity Scaling: $+52.5\%$ gain from $L = 10$ to $L = 30$. See Appendices N, O.
Figure 5: The Validated Physics of Symplectic Attention (Experiments 16 & 18).(A) Single-Layer Induction: Breaking the $N \geq 2$ Barrier. The standard transformer ($\gamma = 0$, red dashed) achieves only random chance (1.2%), while the momentum transformer (green) reaches 83.4% peak accuracy at $\gamma = 4.0$. The phase transition at $\gamma \approx 1.0$ and saturation regime ($\gamma > 4.0$, reflecting position-momentum uncertainty) are clearly visible. (B) The Attenuated Scaling Law: $\gamma^* = 4.17 \times N^{-0.74}$. Sub-linear exponent ($\alpha < 1$) implies signal attenuation across layers, validating the theoretical prediction that momentum and depth are fungible computational resources. See Addendum to Appendix D for complete experimental details across 270+ configurations.

Theorems & Definitions (11)

Definition 2.1: Kinematic Momentum Operator
Theorem 2.2: Uniqueness of the Momentum Operator
proof
Theorem 2.3: Preservation of Symplectic Form
proof
Theorem 2.4: Single-Layer Induction Capability
proof
Theorem 2.5: Velocity Transfer Function
proof
Theorem 2.6: Orthogonality of Semantic and Mechanistic Signals
...and 1 more

Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

TL;DR

Abstract

Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (11)