Table of Contents
Fetching ...

Coupled Query-Key Dynamics for Attention

Barak Gahtan, Alex M. Bronstein

Abstract

Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emph{jointly} through shared learned dynamics before scoring - which we call \textbf{coupled QK dynamics} - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55--22.62 perplexity vs.\ 24.22 for standard attention ($-$6.6--6.9\%), with only 0.11\% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8$\times$ higher seed variance. The integration step count (1--7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that coupling is a \emph{sample-efficiency} mechanism: standard attention trained for 2.4$\times$ longer (matching wall-clock) reaches the same perplexity, but requires 2.4$\times$ more tokens. The advantage scales to 150M ($-$6.7\%) but narrows at 350M ($-$1.0\%), where Differential Attention (18.93) overtakes coupled dynamics (19.35). The benefit is corpus-dependent: coupling helps on domain-coherent text (WikiText-103 $-$6.6\%, PubMed $-$4.5\%) but degrades on heterogeneous web text ($+$10.3\%) and shows no benefit on GLUE. We characterize when coupling helps and when it does not, providing practical guidelines.

Coupled Query-Key Dynamics for Attention

Abstract

Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emph{jointly} through shared learned dynamics before scoring - which we call \textbf{coupled QK dynamics} - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55--22.62 perplexity vs.\ 24.22 for standard attention (6.6--6.9\%), with only 0.11\% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8 higher seed variance. The integration step count (1--7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that coupling is a \emph{sample-efficiency} mechanism: standard attention trained for 2.4 longer (matching wall-clock) reaches the same perplexity, but requires 2.4 more tokens. The advantage scales to 150M (6.7\%) but narrows at 350M (1.0\%), where Differential Attention (18.93) overtakes coupled dynamics (19.35). The benefit is corpus-dependent: coupling helps on domain-coherent text (WikiText-103 6.6\%, PubMed 4.5\%) but degrades on heterogeneous web text (10.3\%) and shows no benefit on GLUE. We characterize when coupling helps and when it does not, providing practical guidelines.

Paper Structure

This paper contains 50 sections, 7 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Training and validation perplexity vs. step (60M model, WikiText-103). Three convergence tiers emerge: (1) Coupled QK separates below the pack from ${\sim}$step 10K, (2) Standard and Diff Attn cluster tightly (23.92--24.22), (3) GQA trails. Training curves are cropped to step 5K+ and PPL $\leq$200 to show the informative region.
  • Figure 2: Training and validation perplexity vs. step (150M model, WikiText-103). Coupled QK dynamics separates early and maintains its lead throughout training. Final ranking: Coupled QK (20.12), Diff Attn (21.11), Standard (21.57), GQA (21.87).
  • Figure 3: Per-layer attention entropy at end of training. Left (60M): A sharp entropy dip at Layer 2 reveals the attention sink. GQA collapses furthest (${\approx}$1.65 nats), followed by Standard (${\approx}$1.95). Coupled QK dynamics (${\approx}$2.5) mitigates the collapse. Right (150M): The Layer 2 collapse persists at scale, confirming it is a structural phenomenon.
  • Figure 4: Standard Attention (60M). Layer 0: Clean causal gradients. Layer 4: Attention sink at position 0 dominates all heads. Layer 7: Sink persists with partial content-based recovery.
  • Figure 5: Coupled QK dynamics (60M). Layer 0: H0 shows off-diagonal patterns absent in Standard, indicating the coupled dynamics has diversified QK interactions. Layer 4: The attention sink is present but less extreme - H0 distributes attention more broadly. Layer 7: H0 exhibits the most diverse pattern across all heads, with scattered attention to multiple specific positions.