Table of Contents
Fetching ...

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham

TL;DR

Streaming DiLoCo introduces streaming partial outer-gradient synchronization, overlap between communication and computation, and 4-bit quantization to enable bandwidth-efficient, scalable distributed training of billion-parameter LLMs. By partitioning outer-gradient exchanges into fragments and overlapping them with inner optimization, the method maintains learning quality while reducing peak bandwidth by orders of magnitude and enabling communication latency to be hidden within computation. Ablations show the approach is robust to fragment choices and asynchronous slack, while experiments on Chinchilla-scale models and Dolma demonstrate competitive performance with dramatically less data movement. This work advances toward a distributed free lunch for large-scale training across co-located and heterogeneous hardware.

Abstract

Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

TL;DR

Streaming DiLoCo introduces streaming partial outer-gradient synchronization, overlap between communication and computation, and 4-bit quantization to enable bandwidth-efficient, scalable distributed training of billion-parameter LLMs. By partitioning outer-gradient exchanges into fragments and overlapping them with inner optimization, the method maintains learning quality while reducing peak bandwidth by orders of magnitude and enabling communication latency to be hidden within computation. Ablations show the approach is robust to fragment choices and asynchronous slack, while experiments on Chinchilla-scale models and Dolma demonstrate competitive performance with dramatically less data movement. This work advances toward a distributed free lunch for large-scale training across co-located and heterogeneous hardware.

Abstract

Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

Paper Structure

This paper contains 40 sections, 18 figures, 7 tables, 2 algorithms.

Figures (18)

  • Figure 1: Streaming DiLoCo: each replica trains independently for dozen of inner optimization steps, and then synchronize a single fragment during outer optimization. In this figure, there are $M=4$ replicas with $p=\{1,2,3\}$ fragments. Each fragment can be made of several transformer layers. Note that this figure only showcases the streaming partial updates (\ref{['sec:model_streaming']}) and not the quantized communication overlapping (subsection \ref{['sec:model_overlapping']} and \ref{['sec:model_low_precision']}).
  • Figure 2: Streaming pattern: sequential (left) and strided (right). Colors denotes the fragment. A different fragment is synchronized each time.
  • Figure 3: Simulation of a schedule interleaving forward passes (in blue), backward passes w.r.t. activations and parameters (resp. in light and dark green), and (outer) gradient reduction (in purple).
  • Figure 4: Compute Utilization simulated across a range of bandwidth. A compute utilization of 0.8 means 80% of the time is spent in computation, and 20% in communication. Our best method reaches a compute utilization of 95% for models 1B, 10B, and 100B with a bandwidth roughly constant between 1 and 5 Gbit/s. Data-Parallel on the other hand requires 100, 200, and 300Gbit/s.
  • Figure 5: Scaling models from 35M (1.49e17 flops) to 4B parameters (2e21 flops) on C4.
  • ...and 13 more figures

Theorems & Definitions (1)

  • Remark