Table of Contents
Fetching ...

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

Sami Jaghouar, Jack Min Ong, Johannes Hagemann

TL;DR

OpenDiLoCo addresses the challenge of training large language models under global, bandwidth-constrained conditions by leveraging a low-communication training paradigm (DiLoCo) implemented in an open-source framework. It provides both a compact PyTorch reference and a Hivemind-based implementation, enabling replication and real-world decentralized deployments, underpinned by a dual-optimizer local SGD that uses two model copies to generate pseudo-gradients and supports FP16 all-reduce. The authors reproduce DiLoCo on a 150M-parameter model and scale the approach to 1.1B parameters, demonstrating comparable or improved perplexity with far lower communication and achieving 90-95% compute utilization in a globally distributed setting. They also show FP16 all-reduce is effective for pseudo-gradients and explore scalability via integration with PyTorch FSDP, offering practical insights into asynchronous opportunities and future work for even larger models. Overall, OpenDiLoCo establishes a practical, open-source pathway for globally distributed low-communication training with strong empirical support and clear directions for scaling and efficiency improvements.

Abstract

OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models. We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library. We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization. Additionally, we conduct ablations studies focusing on the algorithm's compute efficiency, scalability in the number of workers and show that its gradients can be all-reduced using FP16 without any performance degradation. Furthermore, we scale OpenDiLoCo to 3x the size of the original work, demonstrating its effectiveness for billion parameter models.

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

TL;DR

OpenDiLoCo addresses the challenge of training large language models under global, bandwidth-constrained conditions by leveraging a low-communication training paradigm (DiLoCo) implemented in an open-source framework. It provides both a compact PyTorch reference and a Hivemind-based implementation, enabling replication and real-world decentralized deployments, underpinned by a dual-optimizer local SGD that uses two model copies to generate pseudo-gradients and supports FP16 all-reduce. The authors reproduce DiLoCo on a 150M-parameter model and scale the approach to 1.1B parameters, demonstrating comparable or improved perplexity with far lower communication and achieving 90-95% compute utilization in a globally distributed setting. They also show FP16 all-reduce is effective for pseudo-gradients and explore scalability via integration with PyTorch FSDP, offering practical insights into asynchronous opportunities and future work for even larger models. Overall, OpenDiLoCo establishes a practical, open-source pathway for globally distributed low-communication training with strong empirical support and clear directions for scaling and efficiency improvements.

Abstract

OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models. We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library. We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization. Additionally, we conduct ablations studies focusing on the algorithm's compute efficiency, scalability in the number of workers and show that its gradients can be all-reduced using FP16 without any performance degradation. Furthermore, we scale OpenDiLoCo to 3x the size of the original work, demonstrating its effectiveness for billion parameter models.
Paper Structure (15 sections, 8 figures, 3 tables)

This paper contains 15 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Pseudo-Code for Outer Optimizer in OpenDiLoCo.
  • Figure 2: OpenDiLoCo - Hivemind API.
  • Figure 3: Main result: 150 million parameter Llama model pre-training with 8 DiLoCo workers yields significantly lower perplexity than the baseline without DiLoCo, and even compared to the baseline using 8 times larger batch size with the same compute budget, while communicating 500 times less.
  • Figure 4: Ablation Study on the Number of Workers in DiLoCo: Performance comparison of DiLoCo with different numbers of workers and 50 local steps against the baseline without DiLoCo. Due to compute constraints, these ablation experiments were not extended to $88{,}000$ steps like the other experiments.
  • Figure 5: Ablation Study on FLOP Efficiency Relative to Number of Workers in DiLoCo: This figure compares the performance of DiLoCo with different numbers of workers and 50 local steps against the baseline without DiLoCo. The x-axis shows the global steps instead of local steps, providing a better approximation of DiLoCo's FLOP efficiency by comparing the total amount of compute spent on the model.
  • ...and 3 more figures