Table of Contents
Fetching ...

An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning

Chuyan Chen, Chenyang Ma, Zhangxin Li, Yutong He, Yanjie Dong, Kun Yuan

TL;DR

ARC-Top-$K$ introduces an All-Reduce-compatible gradient compressor that preserves globally informative entries while aligning sparsity patterns across nodes via a lightweight gradient sketch. It guarantees global contraction with $\alpha=K/m$ and, when combined with EF21M, delivers faster convergence and significant wall-clock savings compared to Top-$K$ and Rand-$K$. Empirical evaluations across CIFAR, GLUE, and C4 demonstrate that ARC-Top-$K$ matches dense baselines and often outperforms Rand-$K$ at the same compression, with substantial scalability up to 64 nodes. The method enables efficient, scalable distributed learning by enabling index-free All-Reduce, reducing communication without sacrificing convergence or accuracy.

Abstract

Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand-$K$ discards structural information and performs poorly in practice, while Top-$K$ preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-$K$, an {All-Reduce}-Compatible Top-$K$ compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top-$K$ is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top-$K$ matches the accuracy of Top-$K$ while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand-$K$ with the strong performance of Top-$K$.

An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning

TL;DR

ARC-Top- introduces an All-Reduce-compatible gradient compressor that preserves globally informative entries while aligning sparsity patterns across nodes via a lightweight gradient sketch. It guarantees global contraction with and, when combined with EF21M, delivers faster convergence and significant wall-clock savings compared to Top- and Rand-. Empirical evaluations across CIFAR, GLUE, and C4 demonstrate that ARC-Top- matches dense baselines and often outperforms Rand- at the same compression, with substantial scalability up to 64 nodes. The method enables efficient, scalable distributed learning by enabling index-free All-Reduce, reducing communication without sacrificing convergence or accuracy.

Abstract

Communication remains a central bottleneck in large-scale distributed machine learning, and gradient sparsification has emerged as a promising strategy to alleviate this challenge. However, existing gradient compressors face notable limitations: Rand- discards structural information and performs poorly in practice, while Top- preserves informative entries but loses the contraction property and requires costly All-Gather operations. In this paper, we propose ARC-Top-, an {All-Reduce}-Compatible Top- compressor that aligns sparsity patterns across nodes using a lightweight sketch of the gradient, enabling index-free All-Reduce while preserving globally significant information. ARC-Top- is provably contractive and, when combined with momentum error feedback (EF21M), achieves linear speedup and sharper convergence rates than the original EF21M under standard assumptions. Empirically, ARC-Top- matches the accuracy of Top- while reducing wall-clock training time by up to 60.7\%, offering an efficient and scalable solution that combines the robustness of Rand- with the strong performance of Top-.

Paper Structure

This paper contains 13 sections, 6 theorems, 27 equations, 2 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

Let $N=2$, $d=2$, and $K=1$. Consider $\bm g^{(1)}=[-1,\,0.1]^{\top}, \bm g^{(2)}=[1,\,0.1]^{\top},$ and $\bm g = \tfrac{1}{2}(\bm g^{(1)} + \bm g^{(2)})$. Let $\mathcal{C}(\cdot)$ be Top-$K$ compressor defined above. It holds that which implies that ${\mathcal{C}}({\bm g})$ is a non-contractive compressor.

Figures (2)

  • Figure 1: Workflow of the ARC-Top-$K$ algorithm, detailing the process of gradient compression and aggregation across two nodes in a distributed system.
  • Figure 2: Loss curves of pre-training LLaMA-130M on C4.

Theorems & Definitions (10)

  • Definition 1: Contractive Compressor
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1: ARC-Top-$\bm K$ Convergence with EF21M
  • proof : Proof of Theorem \ref{['thm:cov-msgd']}