Logarithmic-time Schedules for Scaling Language Models with Momentum
Damien Ferbach, Courtney Paquette, Gauthier Gidel, Katie Everett, Elliot Paquette
TL;DR
The paper investigates scale-aware optimization for large language models by introducing ADANA, an AdamW-like optimizer that uses logarithmic-time schedules for the 1st and 2nd moments ($β_1(t)$, $β_2(t)$) and decoupled weight decay $λ(t)$, complemented by a damping schedule $α(t)$. This approach leverages the power-law structure of language data to grow the optimizer's memory horizon with training, achieving substantial compute-efficiency gains (up to ~40%) that persist as model size increases. The authors also analyze the stability of log-time momentum, introduce variants (Dana-MK4, Dana-Star, Dana-Star-MK4) to handle sparse gradients and inhomogeneous spectral dimensions, and demonstrate robust gains across transformer ladders from 45M to 2.6B parameters on FineWeb and Qwen3 architectures. They further show that log-time weight decay alone improves performance and that careful scheduling of $β_2$ is essential for stability. Overall, the work provides a principled, scalable framework for transferring optimization hyperparameters across model sizes, with practical improvements in compute efficiency and robustness for large-scale transformer training.
Abstract
In practice, the hyperparameters $(β_1, β_2)$ and weight-decay $λ$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for $(β_1, β_2, λ)$ that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer's gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B parameters), comparing against AdamW, Muon, and AdEMAMix. When properly tuned, ADANA achieves up to 40% compute efficiency relative to a tuned AdamW, with gains that persist--and even improve--as model scale increases. We further show that similar benefits arise when applying logarithmic-time scheduling to AdEMAMix, and that logarithmic-time weight-decay alone can yield significant improvements. Finally, we present variants of ADANA that mitigate potential failure modes and improve robustness.
