Table of Contents
Fetching ...

Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

Yongzhong Xu

TL;DR

It is found that parameter updates organize into a dominant drift direction with transverse residual dynamics, which suggests that optimizer choice shapes the effective dimensionality and structure of learning trajectories beyond what is apparent from loss values alone.

Abstract

We analyze cumulative parameter trajectories of transformer training under AdamW and identify a dominant low-dimensional drift direction ("backbone") that captures 60--80% of long-horizon displacement from initialization. This direction is highly stable over rolling training windows yet reorients gradually across phases, particularly following objective reweighting. Per-batch gradients exhibit near-noise-floor alignment with the backbone, whereas optimizer-integrated updates align strongly with it, indicating that the structure emerges from accumulated optimizer dynamics rather than instantaneous gradient geometry. Replacing AdamW with SGD-family optimizers eliminates this structure, and reducing $β_2$ smoothly degrades backbone dominance and reheating recoverability. Reheating experiments show that transverse probe modes can be transiently re-excited without substantially altering accumulated backbone drift. These results provide a trajectory-level characterization of optimizer-induced geometric structure in transformer training and shift attention from instantaneous gradient properties to cumulative update dynamics.

Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

TL;DR

It is found that parameter updates organize into a dominant drift direction with transverse residual dynamics, which suggests that optimizer choice shapes the effective dimensionality and structure of learning trajectories beyond what is apparent from loss values alone.

Abstract

We analyze cumulative parameter trajectories of transformer training under AdamW and identify a dominant low-dimensional drift direction ("backbone") that captures 60--80% of long-horizon displacement from initialization. This direction is highly stable over rolling training windows yet reorients gradually across phases, particularly following objective reweighting. Per-batch gradients exhibit near-noise-floor alignment with the backbone, whereas optimizer-integrated updates align strongly with it, indicating that the structure emerges from accumulated optimizer dynamics rather than instantaneous gradient geometry. Replacing AdamW with SGD-family optimizers eliminates this structure, and reducing smoothly degrades backbone dominance and reheating recoverability. Reheating experiments show that transverse probe modes can be transiently re-excited without substantially altering accumulated backbone drift. These results provide a trajectory-level characterization of optimizer-induced geometric structure in transformer training and shift attention from instantaneous gradient properties to cumulative update dynamics.
Paper Structure (54 sections, 19 equations, 4 figures, 6 tables)

This paper contains 54 sections, 19 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Backbone--residual decomposition (seed 42, Block 0).Left: the backbone coordinate $a(t)$ grows monotonically while the residual norm $\lVert \mathbf{r}(t) \rVert$ oscillates and then decays. Right: the out-of-distribution probe accuracy $p_{\mathrm{ood}}$ (grey) fluctuates in phase with the residual---but the backbone is impervious. The vertical dashed line marks the $\lambda$-transition at step 4000.
  • Figure 2: Gradient vs. optimizer-update alignment with the backbone (Block 0, seed 42). The 200-step optimizer update $\mathbf{u}_t$ (blue) aligns strongly with the backbone ($|\cos| \approx 0.15$--$0.34$), peaking before the $\lambda$-transition (dashed line) and declining afterward. Per-batch gradients $\mathbf{g}_t$ (grey squares) remain at the random noise floor (${\sim}4 \times 10^{-4}$) throughout. The backbone emerges from optimizer integration, not instantaneous gradient structure.
  • Figure 3: Fisher curvature along key directions (seed 42).Top: Rayleigh quotient $q(\mathbf{v})$ for the backbone (blue), switch direction (red), second PC (green), and mean over random orthogonal directions (grey). The backbone curvature increases by three orders of magnitude. Bottom: Anisotropy ratio $\alpha = q(\mathbf{v}_{\mathrm{b}}) / \mathop{\mathrm{\mathbb{E}}}\nolimits[q(\mathbf{w}_\perp)]$ spikes at the $\lambda$-transition (step 4000) before partially relaxing.
  • Figure 4: Reheating trajectories (seed 42). Three learning rates are tested from the step-10,000 endpoint (grey background: original training). All three achieve probe re-entry, but the effect is transient: $p_{\mathrm{ood}}$ peaks and then decays as the cosine schedule reduces $\eta_t$. The optimal LR ($6\times 10^{-4}$, orange) exceeds the original training peak.

Theorems & Definitions (3)

  • Definition 1: Drift matrix
  • Remark 1: Why uncentered PCA?
  • Definition 2: Backbone decomposition