Cautious Optimizers: Improving Training with One Line of Code
Kaizhao Liang, Lizhang Chen, Bo Liu, Qiang Liu
TL;DR
This work addresses the instability and sometimes slow convergence of momentum-based optimizers by introducing Cautious Optimizers, a one-line masking modification that aligns updates with gradient signs. Framed within a continuous-time Hamiltonian descent model, the authors prove that cautious dynamics preserve the underlying Hamiltonian structure and induce monotone improvements in the objective, with convergent behavior to stationary points. They provide discrete-time analyses and demonstrate universal improvements across tasks, including substantial speedups on LLaMA pretraining ($1.47\times$ for C-AdamW) and faster MAE pretraining, plus better GLUE downstream performance and RLHF rewards. The approach requires minimal implementation effort and no hyperparameter tuning, suggesting a practical, broadly applicable enhancement for large-scale transformer training and beyond.
Abstract
AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.
