MuLoCo: Muon is a practical inner optimizer for DiLoCo
Benjamin Thérien, Xiaolong Huang, Irina Rish, Eugene Belilovsky
TL;DR
This work tackles the communication bottleneck in DiLoCo-based distributed LLM pre-training by introducing MuLoCo, which uses the Muon inner optimizer together with an error-feedback mechanism and aggressive compression. Through systematic ablations across inner optimizers and compression schemes, MuLoCo consistently outperforms DiLoCo, matching or exceeding data-parallel baselines while communicating up to $8×$ less and maintaining identical memory usage. A key finding is that Muon's update structure enhances compressibility, enabling $2$-bit quantization with minimal degradation when paired with error feedback. The study demonstrates a practical pathway to scalable, low-communication pre-training of large transformers in data-center networks, with broad implications for efficiency in distributed deep learning. The results highlight the importance of error feedback in compressed updates and the potential of Muon as a viable inner optimizer for local-gradient-based distributed schemes.
Abstract
DiLoCo is a powerful framework for training large language models (LLMs) under networking constraints with advantages for increasing parallelism and accelerator utilization in data center settings. Despite significantly reducing communication frequency, however, DiLoCo's communication steps still involve all-reducing a complete copy of the model's parameters. While existing works have explored ways to reduce communication in DiLoCo, the role of error feedback accumulators and the effect of the inner-optimizer on compressibility remain under-explored. In this work, we investigate the effectiveness of standard compression methods including Top-k sparsification and quantization for reducing the communication overhead of DiLoCo when paired with two local optimizers (AdamW and Muon). Our experiments pre-training decoder-only transformer language models (LMs) reveal that leveraging Muon as the inner optimizer for DiLoCo along with an error-feedback accumulator allows to aggressively compress the communicated delta to 2-bits with next to no performance degradation. Crucially, MuLoCo (Muon inner optimizer DiLoCo) significantly outperforms DiLoCo while communicating 8X less and having identical memory complexity.
