Table of Contents
Fetching ...

Cautious Optimizers: Improving Training with One Line of Code

Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu

TL;DR

This work addresses the instability and sometimes slow convergence of momentum-based optimizers by introducing Cautious Optimizers, a one-line masking modification that aligns updates with gradient signs. Framed within a continuous-time Hamiltonian descent model, the authors prove that cautious dynamics preserve the underlying Hamiltonian structure and induce monotone improvements in the objective, with convergent behavior to stationary points. They provide discrete-time analyses and demonstrate universal improvements across tasks, including substantial speedups on LLaMA pretraining ($1.47\times$ for C-AdamW) and faster MAE pretraining, plus better GLUE downstream performance and RLHF rewards. The approach requires minimal implementation effort and no hyperparameter tuning, suggesting a practical, broadly applicable enhancement for large-scale transformer training and beyond.

Abstract

AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.

Cautious Optimizers: Improving Training with One Line of Code

TL;DR

This work addresses the instability and sometimes slow convergence of momentum-based optimizers by introducing Cautious Optimizers, a one-line masking modification that aligns updates with gradient signs. Framed within a continuous-time Hamiltonian descent model, the authors prove that cautious dynamics preserve the underlying Hamiltonian structure and induce monotone improvements in the objective, with convergent behavior to stationary points. They provide discrete-time analyses and demonstrate universal improvements across tasks, including substantial speedups on LLaMA pretraining ( for C-AdamW) and faster MAE pretraining, plus better GLUE downstream performance and RLHF rewards. The approach requires minimal implementation effort and no hyperparameter tuning, suggesting a practical, broadly applicable enhancement for large-scale transformer training and beyond.

Abstract

AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.

Paper Structure

This paper contains 37 sections, 9 theorems, 45 equations, 9 figures, 4 tables, 3 algorithms.

Key Result

Theorem 2.1

Following the dynamics in (equ:m_hd) in $\mathbb{R}^d$, we have = (x_t (1 - ϕ(x_t)) - Δ_H_t(w_t, s_t), ddt L(w_t) = - x_t ϕ(x_t) - ∇ L(w_t)_Φ_t^2 = (x_t (1 - ϕ(x_t)) - Δ_L_t(w_t, s_t), Here, $\Delta_{\mathcal{H}_t}(\Bar{\boldsymbol{w}}_t, \Bar{\boldsymbol{s}}_t)$ and $\Delta_{\mathcal{L}_t}(\Bar{

Figures (9)

  • Figure 1: Training Loss Curves on LLaMA 1B, using AdamW / Lion and their cautious variants (using Algorithm \ref{['alg:generic-c-optim']}). The cautious variants achieve better convergence and are 1.47x and 1.28x sample efficient for AdamW and Lion respectively.
  • Figure 2: Left: We compare gradient descent with Polyak momentum (GDM) against its cautious variant (C-GDM) (we also provide gradient descent (GD) result as a baseline and use a 10x larger step size for GD than GDM and C-GDM). Details are provided in Section \ref{['sec::exp-toy']}. The first plot shows the optimization trajectories from the two optimizers, where both optimizers start from $(1, 1)$ with a zero-initialized momentum. C-GDM successfully lands at the optimum without overshooting. The second and third plots confirm that C-GDM always monotonically decreases both the objective and the Hamiltonian of the original GDM. Right: In this plot, we plot $\mathcal{L}(\boldsymbol{w}_t)$ versus $t$ for C-GDM and GDM with different combinations $(\epsilon, \beta)$. Across all combinations, C-GDM outperforms GDM.
  • Figure 3: We compare gradient descent with Polyak momentum (GDM) and its element-wise cautious variant (C-GDM), using gradient descent (GD) as a baseline. The step size for GD and the hyperparameters of GDM (including step size and momentum coefficients) are chosen to achieve the optimal convergence rates, which can be analytically derived (see, e.g., goh2017why). For cautious optimizers, step sizes $\epsilon$ and momentum coefficients $\beta$ are empirically tuned, as shown in Figure \ref{['fig:toy1']}. Detailed experimental settings are described in Section \ref{['sec::exp-toy']}. In Plot (a), we visualize the optimization trajectories of the three methods, starting from the initial point $(1, 1)$ with zero-initialized momentum. Notably, C-GDM converges to the optimum with significantly reduced overshooting and oscillation, Plot (b) zooms in on the trajectories from Plot (a), focusing on a smaller region $(0.02 \times 0.02)$ for enhanced clarity. Furthermore, Plots (c) and (d) show that C-GDM consistently and monotonically decreases both the objective and the Hamiltonian associated with the original GDM, highlighting its superior performance in minimizing these metrics compared to GDM.
  • Figure 4: The sparsity ratio $r(\mathbf{x}) = \frac{\mathtt{nnz}(\mathbf{x} > 0)}{\mathtt{dim}(\mathbf{x})}$ during pretraining of LLaMA 100M on the C4 dataset using the C-AdamW optimizer. The ratio quantifies the proportion of nonzero elements in the representations over training steps.
  • Figure 5: Training loss curves for AdamW, C-AdamW, Lion, C-Lion on LLaMA with 60M, 100M, 350M, and 1B parameters.
  • ...and 4 more figures

Theorems & Definitions (18)

  • Theorem 2.1
  • Corollary 2.2
  • Theorem 2.3
  • Theorem 2.4
  • Example 1.1
  • Example 1.2
  • proof
  • proof
  • proof
  • Theorem 1.3
  • ...and 8 more