OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
Sami Jaghouar, Jack Min Ong, Johannes Hagemann
TL;DR
OpenDiLoCo addresses the challenge of training large language models under global, bandwidth-constrained conditions by leveraging a low-communication training paradigm (DiLoCo) implemented in an open-source framework. It provides both a compact PyTorch reference and a Hivemind-based implementation, enabling replication and real-world decentralized deployments, underpinned by a dual-optimizer local SGD that uses two model copies to generate pseudo-gradients and supports FP16 all-reduce. The authors reproduce DiLoCo on a 150M-parameter model and scale the approach to 1.1B parameters, demonstrating comparable or improved perplexity with far lower communication and achieving 90-95% compute utilization in a globally distributed setting. They also show FP16 all-reduce is effective for pseudo-gradients and explore scalability via integration with PyTorch FSDP, offering practical insights into asynchronous opportunities and future work for even larger models. Overall, OpenDiLoCo establishes a practical, open-source pathway for globally distributed low-communication training with strong empirical support and clear directions for scaling and efficiency improvements.
Abstract
OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models. We provide a reproducible implementation of the DiLoCo experiments, offering it within a scalable, decentralized training framework using the Hivemind library. We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization. Additionally, we conduct ablations studies focusing on the algorithm's compute efficiency, scalability in the number of workers and show that its gradients can be all-reduced using FP16 without any performance degradation. Furthermore, we scale OpenDiLoCo to 3x the size of the original work, demonstrating its effectiveness for billion parameter models.
