Table of Contents
Fetching ...

Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training

Abdelrahman Abouzeid

Abstract

In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen & Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, shows no such penalty. Our evidence points to two failure modes of erf under Muon's faster spectral-norm growth: saturation (lossy compression) and scale blindness (discarding activation magnitude). An EMA-blend that reintroduces running scale estimates recovers ~84% of the gap. Separately, reducing Derf's alpha from its published default (0.5 to 0.3) recovers ~80% by keeping erf in its near-linear regime, where it approximately preserves relative scale; this setting is not the published default of Chen & Liu (2025). Using Derf's published default alpha with Muon incurs a 0.66-nat interaction penalty without producing NaNs or divergence, making the failure easy to miss in short pilot runs.

Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training

Abstract

In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen & Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, shows no such penalty. Our evidence points to two failure modes of erf under Muon's faster spectral-norm growth: saturation (lossy compression) and scale blindness (discarding activation magnitude). An EMA-blend that reintroduces running scale estimates recovers ~84% of the gap. Separately, reducing Derf's alpha from its published default (0.5 to 0.3) recovers ~80% by keeping erf in its near-linear regime, where it approximately preserves relative scale; this setting is not the published default of Chen & Liu (2025). Using Derf's published default alpha with Muon incurs a 0.66-nat interaction penalty without producing NaNs or divergence, making the failure easy to miss in short pilot runs.

Paper Structure

This paper contains 39 sections, 2 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Overview of the optimizer-driven saturation mechanism. At the tested learning rates, Muon grows weights ${\sim}2.2\times$ faster than AdamW, pushing Derf into steep saturation (83% of elements at $\pm 1$). AdamW's slower growth keeps Derf closer to its useful regime (10% saturation). Muon+EMA (bottom) preserves Muon's fast weight growth but uses a running $\hat{\sigma}$ to keep the post-blend erf argument near unit scale (std 1.3, 2% post-blend saturation), recovering ${\sim}84\%$ of the Derf penalty. Weight and saturation numbers from layer 15, step 950 (seed 42); quality gaps at step 1000.
  • Figure 2: Rescaling (RMSNorm) preserves relative magnitudes at any input scale; squashing (Derf) destroys them outside a narrow operating regime.
  • Figure 3: Validation loss over 1000 steps. EMA-blend ($\lambda\!=\!0.9$) converges to within 0.15 nats of RMSNorm+Muon while outperforming RMSNorm+AdamW by 0.41 nats. Derf+Muon fails to outperform Derf+AdamW. Multi-seed variance is negligible. Three-seed results appear in Appendix \ref{['app:multiseed']}.
  • Figure 4: Tensor parallelism benchmark on H100 NVLink (1/2/4/8 GPUs; training uses a single H200, Section \ref{['sec:setup']}). Norm-layer wall-clock time at hidden$=$2048. DyT and Derf are fully reduction-free; EMA requires 33 allreduces/step vs. RMSNorm's 4,224. At 8-way TP: DyT $9.4\times$, Derf $9.3\times$, EMA $7.8\times$ faster than RMSNorm. At 1-GPU (no communication), the gap is pure compute: ${\sim}3.2\times$ for all three, confirming the speedup scales with TP degree.
  • Figure 5: Per-layer diagnostics at step 340 of the main 1000-step seed-42 runs. Saturation, weight growth, and erf input magnitude increase with depth. For EMA-blend, the saturation panel uses post-blend saturation of the actual erf argument. Muon drives the vanilla Derf metrics higher than AdamW, while EMA-blend decouples weight growth from erf input.
  • ...and 3 more figures