Table of Contents
Fetching ...

Communication Efficient LLM Pre-training with SparseLoCo

Amir Sarfi, Benjamin Thérien, Joel Lidin, Eugene Belilovsky

TL;DR

Cross-datacenter training of large language models incurs heavy communication costs. SparseLoCo introduces per-replica Local Outer Momentum and combines Top-$k$ sparsification with error feedback and 2-bit quantization to aggressively compress updates without sacrificing performance, outperforming full-precision DiLoCo and other baselines. The method replaces global outer momentum with local accumulators, linking local momentum to error feedback and enabling sparse aggregation to actually improve results. Empirical evaluation across multiple model scales and replication settings demonstrates significant communication reductions while preserving or improving convergence, placing SparseLoCo on the Pareto frontier of loss versus communication. The work also reports real-world deployment scenarios for internet-scale collaborative training, highlighting practical impact and future potential of sparse aggregation in LLM pre-training.

Abstract

Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across datacenters and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model's gradients-resulting in a communication bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While quantization is often applied to reduce the pseudo-gradient's size, in the context of LLM pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited quantization. In this work, we introduce SparseLoCo, a communication-efficient training algorithm for LLMs that effectively leverages error feedback with Top-k sparsification and 2-bit quantization to reach extreme sparsity as low as 1-3% while outperforming full-precision DiLoCo. Our key observations are that outer momentum can be locally approximated by an error feedback accumulator combined with aggressive sparsity, and that sparse aggregation can actually improve model performance. We empirically demonstrate in a range of communication-constrained LLM training settings that SparseLoCo provides significant benefits in both performance and communication cost.

Communication Efficient LLM Pre-training with SparseLoCo

TL;DR

Cross-datacenter training of large language models incurs heavy communication costs. SparseLoCo introduces per-replica Local Outer Momentum and combines Top- sparsification with error feedback and 2-bit quantization to aggressively compress updates without sacrificing performance, outperforming full-precision DiLoCo and other baselines. The method replaces global outer momentum with local accumulators, linking local momentum to error feedback and enabling sparse aggregation to actually improve results. Empirical evaluation across multiple model scales and replication settings demonstrates significant communication reductions while preserving or improving convergence, placing SparseLoCo on the Pareto frontier of loss versus communication. The work also reports real-world deployment scenarios for internet-scale collaborative training, highlighting practical impact and future potential of sparse aggregation in LLM pre-training.

Abstract

Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across datacenters and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model's gradients-resulting in a communication bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While quantization is often applied to reduce the pseudo-gradient's size, in the context of LLM pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited quantization. In this work, we introduce SparseLoCo, a communication-efficient training algorithm for LLMs that effectively leverages error feedback with Top-k sparsification and 2-bit quantization to reach extreme sparsity as low as 1-3% while outperforming full-precision DiLoCo. Our key observations are that outer momentum can be locally approximated by an error feedback accumulator combined with aggressive sparsity, and that sparse aggregation can actually improve model performance. We empirically demonstrate in a range of communication-constrained LLM training settings that SparseLoCo provides significant benefits in both performance and communication cost.

Paper Structure

This paper contains 25 sections, 1 theorem, 8 equations, 2 figures, 15 tables, 1 algorithm.

Key Result

Proposition 1

Suppose identical initialization $m_r^{(0)}=m^{(0)}=0$ for all $r\in[R]$, and fixed outer-momentum coefficient $\beta\in[0,1)$. Then, for all $t\ge 0$, where $\bar{m}^{(t)} := \tfrac{1}{R}\sum_{r=1}^R m_r^{(t)}$ denotes the average of local momentum buffers, $\bar{\tilde{\Delta}}^{(t)} := \tfrac{1}{R}\sum_{r=1}^R \tilde{\Delta}_r^{(t)}$ the averaged LOM Nesterov direction, and $m^{(t)}$ and $\til

Figures (2)

  • Figure 1: SparseLoCo outperforms DiLoCo for $H\in\{15,30,50,100\}$ communication intervals. We evaluate SparseLoCo, DiLoCo, and DiLoCo without Nesterov for different communication intervals and at different sparsity levels for SparseLoCo. We report the best performance in each case. Crucially, SparseLoCo can outperform DiLoCo while communicating significantly less. We also observe that the optimal density grows with higher communication intervals. All experiments were conducted with $R=8$ workers and 512M model size.
  • Figure 2: SparseLoCo lies on the Pareto frontier between loss and communication volume. We report communication volume (outbound) for two settings (A) ring communication topology (ring all-gather for SparseLoCo and DeMo, ring all-reduce for DiLoCo) (B) Parameter server. The points consider different $H$ for DiLoCo, different densities for DeMo, and combinations of both for SparseLoCo using 512M models. We observe that, in both cases, SparseLoCo is at the Pareto frontier.

Theorems & Definitions (2)

  • Proposition 1
  • proof