MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

Alex Iacob; Andrej Jovanovic; Mher Safaryan; Meghdad Kurmanji; Lorenzo Sani; Samuel Horváth; William F. Shen; Xinchi Qiu; Nicholas D. Lane

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

Alex Iacob, Andrej Jovanovic, Mher Safaryan, Meghdad Kurmanji, Lorenzo Sani, Samuel Horváth, William F. Shen, Xinchi Qiu, Nicholas D. Lane

TL;DR

MT-DAO introduces a multi-timescale optimization framework that employs slow and fast momentum components to stabilize and guide updates across infrequent synchronization in distributed training. By combining multiple first-moment buffers with quasi-hyperbolic momentum, MT-DAO preserves trajectory memory across communication rounds while remaining responsive to loss dynamics, and it provides convergence guarantees for SGDM-based variants. Empirically, MT-DAO closes the perplexity gap with fully synchronous DDP across language-model scales up to 720M parameters, reduces wall-clock time by 6–27%, and achieves substantial communication savings (roughly 10× less than DDP). The results demonstrate that aligning optimizer momentum timescales with communication intervals yields more robust, scalable distributed training and enables effective cross-datacenter and geo-distributed model pretraining.

Abstract

Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when applied to adaptive optimizers, often suffer a performance gap relative to fully synchronous DDP. We trace this gap to a time-scale mismatch: the optimizer's fast-moving momentum, tuned for frequent updates, decays too quickly to smooth gradients over long intervals, leading to noise-dominated optimization. To address this, we propose MT-DAO, a family of optimizers that employs multiple slow- and fast-moving first momenta or the gradient to track update dynamics across different time scales, for which we provide the first convergence guarantees. Empirically, for language-model pre-training, this eliminates the performance gap with DDP, outperforming infrequent-communication baselines in perplexity and reducing iso-token wall-clock time by 6-27% on Ethernet interconnects. At the 720M scale, MT-DAO reaches a target perplexity in 24% fewer steps and 35% less time than the single-momentum DDP baseline. MT-DAO enables effective cross-datacenter training and training over wide geographic areas.

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

TL;DR

Abstract

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (7)