Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

Depen Morwani; Nikhil Vyas; Hanlin Zhang; Sham Kakade

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade

TL;DR

This work links Schedule-Free optimizers, AdEMAMix, MARS, and Lion to accelerated SGD by showing how momentum can be decoupled from the current gradient. It demonstrates that Schedule-Free SGD corresponds to accelerated SGD followed by weight averaging, and that AdEMAMix and related methods align with accelerated SGD through their momentum structures, with preconditioning playing a key role in some cases. Empirical results on a $150$-million-parameter decoder-only transformer show AdEMAMix most closely resembles accelerated SGD and performs best at small batch sizes, while Simplified-AdEMAMix maintains performance across batch regimes with lower memory. The paper provides a practical bridge between theory and practice, offering a memory-efficient variant and public code to facilitate adoption and further study of accelerated SGD-inspired optimizers.

Abstract

Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

TL;DR

Abstract

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)