Eager Updates For Overlapped Communication and Computation in DiLoCo
Satyen Kale, Arthur Douillard, Yanislav Donchev
TL;DR
The paper addresses inefficiencies from outer-gradient synchronization in DiLoCo under low-bandwidth cross-datacenter settings by introducing eager updates that overlap communication with computation. The method mixes current local outer gradients with delayed non-local gradients to produce a fresher proxy, enabling outer updates before all-reduce completion while still incorporating delayed updates when they arrive. Empirical results show that naive delaying degrades convergence, whereas the eager variant achieves competitive training loss and markedly higher compute utilization, scaling well from tens of millions to billions of parameters and reducing bandwidth requirements. This approach offers practical gains for distributed training of large language models in bandwidth-constrained environments and suggests further exploration of convergence theory and partial-step overlap strategies.
Abstract
Distributed optimization methods such as DiLoCo have been shown to be effective in training very large models across multiple distributed workers, such as datacenters. These methods split updates into two parts: an inner optimization phase, where the workers independently execute multiple optimization steps on their own local data, and an outer optimization step, where the inner updates are synchronized. While such approaches require orders of magnitude less communication than standard data-parallel training, in settings where the workers are datacenters, even the limited communication requirements of these approaches can still cause significant slow downs due to the blocking necessary at each outer optimization step. In this paper, we investigate techniques to mitigate this issue by overlapping communication with computation in a manner that allows the outer optimization step to fully overlap with the inner optimization phase. We show that a particular variant, dubbed eager updates, provides competitive performance with standard DiLoCo in settings with low bandwidth between workers.
