Table of Contents
Fetching ...

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao

TL;DR

The paper introduces Generalized Primal Averaging (GPA), a decoupled, stepwise iterate-averaging optimizer that extends Nesterov momentum to address DiLoCo and Schedule-Free limitations in non-distributed training. GPA separates the smoothing of the model evaluation sequence from the information flow into the gradient computation sequence, using two independent parameters and a single extra buffer, enabling stable, faster training on both language and vision tasks. Theoretical analysis provides convergence guarantees when the base optimizer has regret $O(\sqrt{T})$, and empirical results show GPA outperforms DiLoCo and AdamW on Llama-160M/1B and ImageNet ViT across varying batch sizes and inner-step configurations. The work highlights GPA’s potential for more scalable, efficient training and lays groundwork for applying the decoupled interpolation approach to broader distributed settings and optimizers.

Abstract

We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by $O(\sqrt{T})$, where $T$ is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

TL;DR

The paper introduces Generalized Primal Averaging (GPA), a decoupled, stepwise iterate-averaging optimizer that extends Nesterov momentum to address DiLoCo and Schedule-Free limitations in non-distributed training. GPA separates the smoothing of the model evaluation sequence from the information flow into the gradient computation sequence, using two independent parameters and a single extra buffer, enabling stable, faster training on both language and vision tasks. Theoretical analysis provides convergence guarantees when the base optimizer has regret , and empirical results show GPA outperforms DiLoCo and AdamW on Llama-160M/1B and ImageNet ViT across varying batch sizes and inner-step configurations. The work highlights GPA’s potential for more scalable, efficient training and lays groundwork for applying the decoupled interpolation approach to broader distributed settings and optimizers.

Abstract

We propose Generalized Primal Averaging (GPA), an extension of Nesterov's method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo's periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW's) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW's validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by , where is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.

Paper Structure

This paper contains 28 sections, 8 theorems, 51 equations, 10 figures, 4 tables, 3 algorithms.

Key Result

Proposition 1

Given fixed learning rates $\gamma_{\mathop{\mathrm{primal}}\nolimits}, \gamma_{\mathop{\mathrm{modern}}\nolimits} > 0$, the primal averaging formulation of Nesterov's method (equation eq:primal_averaging_form) is equivalent to its modern formulation (equation eq:modern_form) in the sense that when $\mu_{\mathop{\mathrm{primal}}\nolimits} = \mu_{\mathop{\mathrm{modern}}\nolimits} = \mu$ and $\lef

Figures (10)

  • Figure 1: Comparison of validation loss and speedup for AdamW, single-worker DiLoCo, and GPA. Although setting the inner steps = 32 yields a lower final validation loss (see Figure \ref{['fig:consolidated_valloss_vs_comm_intervals']}), setting the inner steps = 16 is faster in terms of number of steps to attain the target validation loss (see Figure \ref{['fig:bar_consolidated_valloss_vs_comm_intervals']}).
  • Figure 2: Comparison of DiLoCo and GPA's trajectories on a deterministic quadratic problem. The outer iterates of DiLoCo are shown as red points, and the inner iterates as thin red lines.
  • Figure 3: Comparison of the validation loss against the number of steps for different optimizers on the Llama-160M workload.
  • Figure 4: Comparison of AdamW and GPA on ImageNet ViT-S/16 from timm with data augmentations using a batch size of 4,096 samples. The optimal configuration for both AdamW and GPA use a learning rate of 0.005 and weight decay of 0.1.
  • Figure 5: Comparison of AdamW and GPA on ImageNet ViT-S/16 from timm with data augmentations using a batch size of 16,384 samples.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Proposition 1
  • Theorem 1
  • Corollary 1
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Lemma 1
  • Theorem 2
  • proof
  • ...and 2 more