Table of Contents
Fetching ...

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade

TL;DR

This work links Schedule-Free optimizers, AdEMAMix, MARS, and Lion to accelerated SGD by showing how momentum can be decoupled from the current gradient. It demonstrates that Schedule-Free SGD corresponds to accelerated SGD followed by weight averaging, and that AdEMAMix and related methods align with accelerated SGD through their momentum structures, with preconditioning playing a key role in some cases. Empirical results on a $150$-million-parameter decoder-only transformer show AdEMAMix most closely resembles accelerated SGD and performs best at small batch sizes, while Simplified-AdEMAMix maintains performance across batch regimes with lower memory. The paper provides a practical bridge between theory and practice, offering a memory-efficient variant and public code to facilitate adoption and further study of accelerated SGD-inspired optimizers.

Abstract

Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

TL;DR

This work links Schedule-Free optimizers, AdEMAMix, MARS, and Lion to accelerated SGD by showing how momentum can be decoupled from the current gradient. It demonstrates that Schedule-Free SGD corresponds to accelerated SGD followed by weight averaging, and that AdEMAMix and related methods align with accelerated SGD through their momentum structures, with preconditioning playing a key role in some cases. Empirical results on a -million-parameter decoder-only transformer show AdEMAMix most closely resembles accelerated SGD and performs best at small batch sizes, while Simplified-AdEMAMix maintains performance across batch regimes with lower memory. The paper provides a practical bridge between theory and practice, offering a memory-efficient variant and public code to facilitate adoption and further study of accelerated SGD-inspired optimizers.

Abstract

Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.

Paper Structure

This paper contains 26 sections, 25 equations, 3 figures, 3 algorithms.

Figures (3)

  • Figure 1: Comparison of the best runs of various optimizers as stated in Section \ref{['sec:exp']} for language modeling task on a decoder-only 150m transformer model. We find that AdEMAMix and simplified-AdEMAMix perform the best, owing to their precise similarity to accelerated SGD variants.
  • Figure 2: Comparison of the best runs of AdamW with cosine decay, schedule free AdamW and LAProp at higher batch size. Experimental details can be found in Section \ref{['sec:largebsz']}
  • Figure 3: Comparison of the best runs of AdEMAMix (with and without $\beta_1 = 0.0$) and our variant of simplified AdEMAMix for higher batch size experiments. Experimental details can be found in Section \ref{['sec:largebsz']}