Table of Contents
Fetching ...

Eager Updates For Overlapped Communication and Computation in DiLoCo

Satyen Kale, Arthur Douillard, Yanislav Donchev

TL;DR

The paper addresses inefficiencies from outer-gradient synchronization in DiLoCo under low-bandwidth cross-datacenter settings by introducing eager updates that overlap communication with computation. The method mixes current local outer gradients with delayed non-local gradients to produce a fresher proxy, enabling outer updates before all-reduce completion while still incorporating delayed updates when they arrive. Empirical results show that naive delaying degrades convergence, whereas the eager variant achieves competitive training loss and markedly higher compute utilization, scaling well from tens of millions to billions of parameters and reducing bandwidth requirements. This approach offers practical gains for distributed training of large language models in bandwidth-constrained environments and suggests further exploration of convergence theory and partial-step overlap strategies.

Abstract

Distributed optimization methods such as DiLoCo have been shown to be effective in training very large models across multiple distributed workers, such as datacenters. These methods split updates into two parts: an inner optimization phase, where the workers independently execute multiple optimization steps on their own local data, and an outer optimization step, where the inner updates are synchronized. While such approaches require orders of magnitude less communication than standard data-parallel training, in settings where the workers are datacenters, even the limited communication requirements of these approaches can still cause significant slow downs due to the blocking necessary at each outer optimization step. In this paper, we investigate techniques to mitigate this issue by overlapping communication with computation in a manner that allows the outer optimization step to fully overlap with the inner optimization phase. We show that a particular variant, dubbed eager updates, provides competitive performance with standard DiLoCo in settings with low bandwidth between workers.

Eager Updates For Overlapped Communication and Computation in DiLoCo

TL;DR

The paper addresses inefficiencies from outer-gradient synchronization in DiLoCo under low-bandwidth cross-datacenter settings by introducing eager updates that overlap communication with computation. The method mixes current local outer gradients with delayed non-local gradients to produce a fresher proxy, enabling outer updates before all-reduce completion while still incorporating delayed updates when they arrive. Empirical results show that naive delaying degrades convergence, whereas the eager variant achieves competitive training loss and markedly higher compute utilization, scaling well from tens of millions to billions of parameters and reducing bandwidth requirements. This approach offers practical gains for distributed training of large language models in bandwidth-constrained environments and suggests further exploration of convergence theory and partial-step overlap strategies.

Abstract

Distributed optimization methods such as DiLoCo have been shown to be effective in training very large models across multiple distributed workers, such as datacenters. These methods split updates into two parts: an inner optimization phase, where the workers independently execute multiple optimization steps on their own local data, and an outer optimization step, where the inner updates are synchronized. While such approaches require orders of magnitude less communication than standard data-parallel training, in settings where the workers are datacenters, even the limited communication requirements of these approaches can still cause significant slow downs due to the blocking necessary at each outer optimization step. In this paper, we investigate techniques to mitigate this issue by overlapping communication with computation in a manner that allows the outer optimization step to fully overlap with the inner optimization phase. We show that a particular variant, dubbed eager updates, provides competitive performance with standard DiLoCo in settings with low bandwidth between workers.

Paper Structure

This paper contains 20 sections, 7 figures, 6 tables, 3 algorithms.

Figures (7)

  • Figure 1: Data flow and operations in standard DiLoCo. Here, 4 workers execute in parallel and alternate sequentially computation (the outer and inner optimization steps) and communication (averaging outer gradients across workers).
  • Figure 2: Data flow and operations in DiLoCo with delayed outer gradients. Here, 4 workers execute optimization steps in parallel with each other, as well as with the communication required for averaging outer gradients. This is accomplished by delaying the application of the averaged outer gradient in the outer optimizer.
  • Figure 3: Compute Utilization simulated across a range of bandwidth. A compute utilization of 0.8 means 80% of the time is spent in computation, and 20% in communication. Our best method reaches a compute utilization of 95% for models 1B, 10B, and 100B with a bandwidth roughly constant between 1 and 5 Gbit/s. Data-Parallel on the other hand requires 100, 200, and 300Gbit/s.
  • Figure 4: Scaling models from 35M (1.49e17 flops) to 1B parameters (1.9e20 flops) on C4.
  • Figure 5: Comparison of overlapping communication over an outer step, using the naïve delayed version (\ref{['alg:naïve-delayed']}) and the eager version (\ref{['alg:eager-delayed']}) when varying the number of inner steps $H$.
  • ...and 2 more figures