Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs
Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao
TL;DR
The paper introduces Generalized Primal Averaging (GPA), a decoupled, stepwise iterate-averaging optimizer that extends Nesterov momentum to address DiLoCo and Schedule-Free limitations in non-distributed training. GPA separates the smoothing of the model evaluation sequence from the information flow into the gradient computation sequence, using two independent parameters and a single extra buffer, enabling stable, faster training on both language and vision tasks. Theoretical analysis provides convergence guarantees when the base optimizer has regret $O(\sqrt{T})$, and empirical results show GPA outperforms DiLoCo and AdamW on Llama-160M/1B and ImageNet ViT across varying batch sizes and inner-step configurations. The work highlights GPA’s potential for more scalable, efficient training and lays groundwork for applying the decoupled interpolation approach to broader distributed settings and optimizers.
Abstract
We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.
