Table of Contents
Fetching ...

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

Alex Iacob, Andrej Jovanovic, Mher Safaryan, Meghdad Kurmanji, Lorenzo Sani, Samuel Horváth, William F. Shen, Xinchi Qiu, Nicholas D. Lane

TL;DR

MT-DAO introduces a multi-timescale optimization framework that employs slow and fast momentum components to stabilize and guide updates across infrequent synchronization in distributed training. By combining multiple first-moment buffers with quasi-hyperbolic momentum, MT-DAO preserves trajectory memory across communication rounds while remaining responsive to loss dynamics, and it provides convergence guarantees for SGDM-based variants. Empirically, MT-DAO closes the perplexity gap with fully synchronous DDP across language-model scales up to 720M parameters, reduces wall-clock time by 6–27%, and achieves substantial communication savings (roughly 10× less than DDP). The results demonstrate that aligning optimizer momentum timescales with communication intervals yields more robust, scalable distributed training and enables effective cross-datacenter and geo-distributed model pretraining.

Abstract

Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when applied to adaptive optimizers, often suffer a performance gap relative to fully synchronous DDP. We trace this gap to a time-scale mismatch: the optimizer's fast-moving momentum, tuned for frequent updates, decays too quickly to smooth gradients over long intervals, leading to noise-dominated optimization. To address this, we propose MT-DAO, a family of optimizers that employs multiple slow- and fast-moving first momenta or the gradient to track update dynamics across different time scales, for which we provide the first convergence guarantees. Empirically, for language-model pre-training, this eliminates the performance gap with DDP, outperforming infrequent-communication baselines in perplexity and reducing iso-token wall-clock time by 6-27% on Ethernet interconnects. At the 720M scale, MT-DAO reaches a target perplexity in 24% fewer steps and 35% less time than the single-momentum DDP baseline. MT-DAO enables effective cross-datacenter training and training over wide geographic areas.

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

TL;DR

MT-DAO introduces a multi-timescale optimization framework that employs slow and fast momentum components to stabilize and guide updates across infrequent synchronization in distributed training. By combining multiple first-moment buffers with quasi-hyperbolic momentum, MT-DAO preserves trajectory memory across communication rounds while remaining responsive to loss dynamics, and it provides convergence guarantees for SGDM-based variants. Empirically, MT-DAO closes the perplexity gap with fully synchronous DDP across language-model scales up to 720M parameters, reduces wall-clock time by 6–27%, and achieves substantial communication savings (roughly 10× less than DDP). The results demonstrate that aligning optimizer momentum timescales with communication intervals yields more robust, scalable distributed training and enables effective cross-datacenter and geo-distributed model pretraining.

Abstract

Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when applied to adaptive optimizers, often suffer a performance gap relative to fully synchronous DDP. We trace this gap to a time-scale mismatch: the optimizer's fast-moving momentum, tuned for frequent updates, decays too quickly to smooth gradients over long intervals, leading to noise-dominated optimization. To address this, we propose MT-DAO, a family of optimizers that employs multiple slow- and fast-moving first momenta or the gradient to track update dynamics across different time scales, for which we provide the first convergence guarantees. Empirically, for language-model pre-training, this eliminates the performance gap with DDP, outperforming infrequent-communication baselines in perplexity and reducing iso-token wall-clock time by 6-27% on Ethernet interconnects. At the 720M scale, MT-DAO reaches a target perplexity in 24% fewer steps and 35% less time than the single-momentum DDP baseline. MT-DAO enables effective cross-datacenter training and training over wide geographic areas.

Paper Structure

This paper contains 38 sections, 4 theorems, 57 equations, 11 figures, 3 tables, 4 algorithms.

Key Result

Theorem 1

Let Assumptions ass:smooth, ass:boundgrad and ass:het hold. Then, choosing the step size $\eta = \min(\eta_0, \frac{1}{\sqrt{T}})$ where $\eta_0 \stackrel{\mathrm{def}}{=} 1/( 4L \max(\beta_{\omega}, 6\sqrt{\psi\max(1,B^2-1)}) )$ with constants the average iterates $x_t = \mathbb{E}_m[x_t^m]$ of MT-DAO-SGDM converge with the following rate:

Figures (11)

  • Figure 1: To highlight the stability benefit of MT-DAO, we illustrate its performance on a toy non-convex problem. Crucially, under a high momentum decay of $\beta=0.9999$, prior stateful methods like Local AdamLocalAdam become unstable and fail to converge, whereas MT-DAO maintains its rapid and stable convergence. We optimize the non-convex Rosenbrock function $f(x_1, x_2) = (1 - x_1)^2 + 100(x_2 - x_1^2)^2$ with $M=256$ workers and IID Gaussian noise ($\sigma=2$).
  • Figure 2: Comparison of Local SGDM with standard momentum (top) and MT-DAO-SGDM ($N=1$ momentum, $\omega_1=0.95$) (bottom) for the function $f(x;\lambda)=\frac{1}{2}{\lambda x^2}$ with $x \in \mathbb{R}$ for various parameters controlling the rate of change $\lambda$ and and sync frequencies (frequent: solid, infrequent/slow: dashed). While both optimizers are stable at low momentum ($\beta=0.9$), at high momentum ($\beta=0.999$) Local SGD with standard momentum becomes unstable for high $\lambda$ while MT-DAO-SGDM remains stable.
  • Figure 3: A comparison of MT-DAO ($\beta_1=0.999$) versus a LocalADOPT baseline ($\beta_1=0.95$) with a communication frequency of $K=32$. For each communication round, we plot metrics computed between the momentum at the start ($t$) and end ($t+K$) of the round. MT-DAO's slow momentum preserves mutual information, $I(U_{t}; U_{t+K})$, across rounds while the baseline's momentum decays losing the global optimization direction (left). Furthermore, MT-DAOreduces inter-worker momentum variance, $\text{Var}(u_{t+K})$, indicating greater stability against local noise (right).
  • Figure 4: Mean relative L2 change and standard deviation across communication rounds of (left) model parameters and (right) the first momentum state, as a function of momentum decay ($\beta_1$) and weight ($\omega_1$). In both cases, MT-DAO shows a significantly reduced relative rate of change with high $(\beta_1, \omega_1)$ (minimum in gold), which reduces worker drift and thus makes parameter averaging more effective. Each point on the grid corresponds to a configuration evaluated with its own independently tuned learning rate. LocalADOPT corresponds to ($\beta_1=0.95,\omega_1=1.0)$.
  • Figure 5: Validation perplexity versus wall-clock time and training tokens for MT-DAO and baselines on models of size (a) 16M, (b) 125M, and (c) 720M models. Horizontal lines denote the two DDP baselines (ADOPT-DDP and QHADOPT-DDP). For each non-DDP method, a colored marker on the x-axis marks the earliest point at which its curve attains a lower/equal perplexity to a DDP variant.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof