Table of Contents
Fetching ...

MuLoCo: Muon is a practical inner optimizer for DiLoCo

Benjamin Thérien, Xiaolong Huang, Irina Rish, Eugene Belilovsky

TL;DR

This work tackles the communication bottleneck in DiLoCo-based distributed LLM pre-training by introducing MuLoCo, which uses the Muon inner optimizer together with an error-feedback mechanism and aggressive compression. Through systematic ablations across inner optimizers and compression schemes, MuLoCo consistently outperforms DiLoCo, matching or exceeding data-parallel baselines while communicating up to $8×$ less and maintaining identical memory usage. A key finding is that Muon's update structure enhances compressibility, enabling $2$-bit quantization with minimal degradation when paired with error feedback. The study demonstrates a practical pathway to scalable, low-communication pre-training of large transformers in data-center networks, with broad implications for efficiency in distributed deep learning. The results highlight the importance of error feedback in compressed updates and the potential of Muon as a viable inner optimizer for local-gradient-based distributed schemes.

Abstract

DiLoCo is a powerful framework for training large language models (LLMs) under networking constraints with advantages for increasing parallelism and accelerator utilization in data center settings. Despite significantly reducing communication frequency, however, DiLoCo's communication steps still involve all-reducing a complete copy of the model's parameters. While existing works have explored ways to reduce communication in DiLoCo, the role of error feedback accumulators and the effect of the inner-optimizer on compressibility remain under-explored. In this work, we investigate the effectiveness of standard compression methods including Top-k sparsification and quantization for reducing the communication overhead of DiLoCo when paired with two local optimizers (AdamW and Muon). Our experiments pre-training decoder-only transformer language models (LMs) reveal that leveraging Muon as the inner optimizer for DiLoCo along with an error-feedback accumulator allows to aggressively compress the communicated delta to 2-bits with next to no performance degradation. Crucially, MuLoCo (Muon inner optimizer DiLoCo) significantly outperforms DiLoCo while communicating 8X less and having identical memory complexity.

MuLoCo: Muon is a practical inner optimizer for DiLoCo

TL;DR

This work tackles the communication bottleneck in DiLoCo-based distributed LLM pre-training by introducing MuLoCo, which uses the Muon inner optimizer together with an error-feedback mechanism and aggressive compression. Through systematic ablations across inner optimizers and compression schemes, MuLoCo consistently outperforms DiLoCo, matching or exceeding data-parallel baselines while communicating up to less and maintaining identical memory usage. A key finding is that Muon's update structure enhances compressibility, enabling -bit quantization with minimal degradation when paired with error feedback. The study demonstrates a practical pathway to scalable, low-communication pre-training of large transformers in data-center networks, with broad implications for efficiency in distributed deep learning. The results highlight the importance of error feedback in compressed updates and the potential of Muon as a viable inner optimizer for local-gradient-based distributed schemes.

Abstract

DiLoCo is a powerful framework for training large language models (LLMs) under networking constraints with advantages for increasing parallelism and accelerator utilization in data center settings. Despite significantly reducing communication frequency, however, DiLoCo's communication steps still involve all-reducing a complete copy of the model's parameters. While existing works have explored ways to reduce communication in DiLoCo, the role of error feedback accumulators and the effect of the inner-optimizer on compressibility remain under-explored. In this work, we investigate the effectiveness of standard compression methods including Top-k sparsification and quantization for reducing the communication overhead of DiLoCo when paired with two local optimizers (AdamW and Muon). Our experiments pre-training decoder-only transformer language models (LMs) reveal that leveraging Muon as the inner optimizer for DiLoCo along with an error-feedback accumulator allows to aggressively compress the communicated delta to 2-bits with next to no performance degradation. Crucially, MuLoCo (Muon inner optimizer DiLoCo) significantly outperforms DiLoCo while communicating 8X less and having identical memory complexity.

Paper Structure

This paper contains 15 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: MuLoCo (Muon inner optimizer) with 2-bit quantization and error feedback outperforms standard AdamW-DiLoCo with $8\times$ less communication while having identical memory complexity. The figure reports the test loss ($y$-axis) measured at each communication step during pre-training. The $x$-axis reports the total number of bits communicated for a 220M parameter transformer LM. We use $K=8$ workers and $H=30$ local steps.
  • Figure 2: Error feedback ablation for MuLoCo v.s. DiLoCo with compressed updates. We vary the strength of the (LEFT) Top-$k$ sparsification ($1-50\%$) and (RIGHT) Quantization ($2$,$4$, and $8$ bits). EF designates runs using error feedback. The dashed lines report the final performance of the no-compression baseline (16-bit floats). We observe that EF consistently improves performance and that MuLoCo's advantage over DiLoCo improves as quantization increases.
  • Figure 3: Comparison to Data Parallel baselines. We report the test loss at communication steps during training of MuLoCo and DiLoCo compared to their corresponding Data Parallel baselines (dashed lines). We observe that MuLoCo matches its data-parallel and that DiLoCo nearly matches its data-parallel baseline.
  • Figure 4: Top-$k$ v.s. DCT Top-$k$ sparsification and random sparsification results. We use DCT compression with a chunk size $s=128$. We observe that DCT compression slightly improves performance for both MuLoCo and DiLoCo at $10\%$ and $20\%$ compared to Top-$k$. For $5\%$ MuLoCo's performance degrades while DiLoCo still works well. We observe that Random-$k$ compression results lead to fast performance deterioration.
  • Figure 5: Training curves for Muon DiLoCo with different training curves. We report the training loss for variants of DiLoCo with different inner optimizers. All hyperparameters were tuned on a setup without any compression, except the error feedback $\beta$ parameter. Any non-visible curves that appear in the legend either overlap with another curve or do not reach low enough loss to be seen.
  • ...and 2 more figures