Table of Contents
Fetching ...

Logarithmic-time Schedules for Scaling Language Models with Momentum

Damien Ferbach, Courtney Paquette, Gauthier Gidel, Katie Everett, Elliot Paquette

TL;DR

The paper investigates scale-aware optimization for large language models by introducing ADANA, an AdamW-like optimizer that uses logarithmic-time schedules for the 1st and 2nd moments ($β_1(t)$, $β_2(t)$) and decoupled weight decay $λ(t)$, complemented by a damping schedule $α(t)$. This approach leverages the power-law structure of language data to grow the optimizer's memory horizon with training, achieving substantial compute-efficiency gains (up to ~40%) that persist as model size increases. The authors also analyze the stability of log-time momentum, introduce variants (Dana-MK4, Dana-Star, Dana-Star-MK4) to handle sparse gradients and inhomogeneous spectral dimensions, and demonstrate robust gains across transformer ladders from 45M to 2.6B parameters on FineWeb and Qwen3 architectures. They further show that log-time weight decay alone improves performance and that careful scheduling of $β_2$ is essential for stability. Overall, the work provides a principled, scalable framework for transferring optimization hyperparameters across model sizes, with practical improvements in compute efficiency and robustness for large-scale transformer training.

Abstract

In practice, the hyperparameters $(β_1, β_2)$ and weight-decay $λ$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for $(β_1, β_2, λ)$ that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer's gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B parameters), comparing against AdamW, Muon, and AdEMAMix. When properly tuned, ADANA achieves up to 40% compute efficiency relative to a tuned AdamW, with gains that persist--and even improve--as model scale increases. We further show that similar benefits arise when applying logarithmic-time scheduling to AdEMAMix, and that logarithmic-time weight-decay alone can yield significant improvements. Finally, we present variants of ADANA that mitigate potential failure modes and improve robustness.

Logarithmic-time Schedules for Scaling Language Models with Momentum

TL;DR

The paper investigates scale-aware optimization for large language models by introducing ADANA, an AdamW-like optimizer that uses logarithmic-time schedules for the 1st and 2nd moments (, ) and decoupled weight decay , complemented by a damping schedule . This approach leverages the power-law structure of language data to grow the optimizer's memory horizon with training, achieving substantial compute-efficiency gains (up to ~40%) that persist as model size increases. The authors also analyze the stability of log-time momentum, introduce variants (Dana-MK4, Dana-Star, Dana-Star-MK4) to handle sparse gradients and inhomogeneous spectral dimensions, and demonstrate robust gains across transformer ladders from 45M to 2.6B parameters on FineWeb and Qwen3 architectures. They further show that log-time weight decay alone improves performance and that careful scheduling of is essential for stability. Overall, the work provides a principled, scalable framework for transferring optimization hyperparameters across model sizes, with practical improvements in compute efficiency and robustness for large-scale transformer training.

Abstract

In practice, the hyperparameters and weight-decay in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer's gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B parameters), comparing against AdamW, Muon, and AdEMAMix. When properly tuned, ADANA achieves up to 40% compute efficiency relative to a tuned AdamW, with gains that persist--and even improve--as model scale increases. We further show that similar benefits arise when applying logarithmic-time scheduling to AdEMAMix, and that logarithmic-time weight-decay alone can yield significant improvements. Finally, we present variants of ADANA that mitigate potential failure modes and improve robustness.
Paper Structure (211 sections, 9 theorems, 172 equations, 37 figures, 23 tables, 12 algorithms)

This paper contains 211 sections, 9 theorems, 172 equations, 37 figures, 23 tables, 12 algorithms.

Key Result

Lemma J.1

Let $\delta > 0$, $0 < \beta_3 < 1$ (short momentum), and define for any $T>0$, $\beta_1 = 1 - \delta/T$ (target long momentum). Suppose that $0 < \beta_1 < 1$ and define the Ademamix schedule Then as $t, T \to \infty$ with $0 < t \leq T$, we have

Figures (37)

  • Figure 1: Scaling laws and compute multiplier$^1$ vs compute $C$ (in Petaflop-hours; 1 PFH $= 3.6 \times 10^{18}$ FLOPs $\approx$ 1 H100-hour) for ADana variants with transformers (architecture in Tab. \ref{['table:enoki_large_41_compact']}) on FineWeb. Left axis: compute-efficiency relative to AdamW. ADana and Dana-Star-MK4's ($\kappa=0.85$) benefits increase with scale; ${\sim}40\%$ compute efficiency. Ademamix (DW)$^0$ and Muon (constant WD) consistently outperform AdamW across scales, but compute efficiency decreasing at larger scales. Right axis: validation loss as a function of compute, fit to a broken power-law $L = a + b C^{-c} + e C^{-f}$ with shared saturation $a$ across optimizers (Scaling law procedures & prior work; Sec. \ref{['subsec:single_vs_broken']}).
  • Figure 2: Training loss curves. Validation loss over training across scales from 45.7M to 2.62B parameters for AdamW and ADana ($\kappa=0.85$) on FineWeb penedo2024fineweb; Train at compute-optimal scaling $D = 20 N$ where $N$ is the total number of parameters and $D$ is the number of tokens; Final validation loss follows scaling law hoffmann2022chinchilla. ADana shows better loss than AdamW along the majority of training, especially at larger scales and consistently outperforms at the end of training.
  • Figure 3: Instabilities in Logarithmic-time MomentumLog-NAdamW (DW) is unstable due to the undamped momentum learning rate $\alpha(t) = \delta+t$. Log-AdamW (DW) and ADana-no-gradient respectively use $\alpha(t)=1$ and $\alpha(t) = (1+t)^{1-\kappa}$, $\kappa=0.75$ without the stabilizing gradient term and show degraded performance against baseline AdamW (DW). Both stabilizing gradient and damped momentum learning rate $\alpha(t)$ are necessary to achieve good performances.
  • Figure 4: Impact & sensitivity of $\kappa$ on scaling performance of ADana Alg. \ref{['alg:adana']} on Enoki transformer scaling ladder with FineWeb; $\kappa=0.85$ allows for best performance. At the largest scales compared with AdamW-log-time WD, the optimal $\kappa=0.85$ yields more than $30\%$ ($40\%$ w.r.t AdamW with constant WD) compute gain improvement against less than $20\%$ for $\kappa=0.75$; Performance improves across scales for more conservative $\kappa\geq 0.85$ while degrading for the more aggressive $\kappa\leq 0.8$; $\kappa=1.0$ similar to the baseline as predicted by toy model PLRF.
  • Figure 5: Dependence of $\alpha(t)$ on time and scale. Sweeps of constant $\tilde{\alpha}$ in ADana with $\alpha(t) = \tilde{\alpha} \cdot (1+t)$. Vary both the iterations, going beyond Chinchilla scale, and model sizes. Fitting optimal $\tilde{\alpha}$ recovers $\alpha(t) = (1+t)^{1-\kappa}$. Note here $\tilde{\alpha} \approx 1$ for $B = 32$.
  • ...and 32 more figures

Theorems & Definitions (13)

  • Lemma J.1: Ademamix schedule approximates Dana schedule
  • proof
  • Theorem L.1
  • Corollary L.2: Divergence of Log-AdamW for Lipschitz functions
  • Theorem L.3: Adam with constant $\beta_1$, $\beta_2$ satisfies necessary condition
  • Theorem L.4
  • Theorem P.1: Boundedness of Standard Adam, Theorem \ref{['thm:constant_beta_necessary']}
  • proof : Proof of Theorem \ref{['thm:normal_tightness']}
  • Theorem P.2: Boundedness of log-time schedules for $\beta_1$ and $\beta_2$ if multiplied by $\sqrt{p}$, Theorem \ref{['thm:sparsity_main']}
  • proof : Proof of Theorem \ref{['thm:long_tightness']}
  • ...and 3 more