Table of Contents
Fetching ...

Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees

Chuyan Chen, Yutong He, Pengrui Li, Weichen Jia, Kun Yuan

TL;DR

GreedyLore addresses the communication bottleneck in distributed stochastic optimization by introducing a greedy low-rank gradient compressor with error feedback and a semi-lazy SVD update. It also employs an approximate global Top-$r$ projection to better capture the structure of the global gradient. The authors prove convergence guarantees, achieving the rate $\mathcal{O}\left(\frac{\sigma}{\sqrt{NT}} + \frac{1}{T}\right)$ under MSGD and Adam and demonstrate a linear speedup in iteration complexity with the number of nodes, $N$. Empirical results across ResNet pre-training on CIFAR, LLaMA pre-training, and RoBERTa fine-tuning validate the method's superiority over prior low-rank and quantization-based compressors, with practical memory overhead and seamless integration into standard distributed training frameworks.

Abstract

Distributed optimization is pivotal for large-scale signal processing and machine learning, yet communication overhead remains a major bottleneck. Low-rank gradient compression, in which the transmitted gradients are approximated by low-rank matrices to reduce communication, offers a promising remedy. Existing methods typically adopt either randomized or greedy compression strategies: randomized approaches project gradients onto randomly chosen subspaces, introducing high variance and degrading empirical performance; greedy methods select the most informative subspaces, achieving strong empirical results but lacking convergence guarantees. To address this gap, we propose GreedyLore--the first Greedy Low-Rank gradient compression algorithm for distributed learning with rigorous convergence guarantees. GreedyLore incorporates error feedback to correct the bias introduced by greedy compression and introduces a semi-lazy subspace update that ensures the compression operator remains contractive throughout all iterations. With these techniques, we prove that GreedyLore achieves a convergence rate of $\mathcal{O}(σ/\sqrt{NT} + 1/T)$ under standard optimizers such as MSGD and Adam--marking the first linear speedup convergence rate for low-rank gradient compression. Extensive experiments are conducted to validate our theoretical findings.

Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees

TL;DR

GreedyLore addresses the communication bottleneck in distributed stochastic optimization by introducing a greedy low-rank gradient compressor with error feedback and a semi-lazy SVD update. It also employs an approximate global Top- projection to better capture the structure of the global gradient. The authors prove convergence guarantees, achieving the rate under MSGD and Adam and demonstrate a linear speedup in iteration complexity with the number of nodes, . Empirical results across ResNet pre-training on CIFAR, LLaMA pre-training, and RoBERTa fine-tuning validate the method's superiority over prior low-rank and quantization-based compressors, with practical memory overhead and seamless integration into standard distributed training frameworks.

Abstract

Distributed optimization is pivotal for large-scale signal processing and machine learning, yet communication overhead remains a major bottleneck. Low-rank gradient compression, in which the transmitted gradients are approximated by low-rank matrices to reduce communication, offers a promising remedy. Existing methods typically adopt either randomized or greedy compression strategies: randomized approaches project gradients onto randomly chosen subspaces, introducing high variance and degrading empirical performance; greedy methods select the most informative subspaces, achieving strong empirical results but lacking convergence guarantees. To address this gap, we propose GreedyLore--the first Greedy Low-Rank gradient compression algorithm for distributed learning with rigorous convergence guarantees. GreedyLore incorporates error feedback to correct the bias introduced by greedy compression and introduces a semi-lazy subspace update that ensures the compression operator remains contractive throughout all iterations. With these techniques, we prove that GreedyLore achieves a convergence rate of under standard optimizers such as MSGD and Adam--marking the first linear speedup convergence rate for low-rank gradient compression. Extensive experiments are conducted to validate our theoretical findings.

Paper Structure

This paper contains 29 sections, 11 theorems, 78 equations, 5 figures, 7 tables, 5 algorithms.

Key Result

Proposition 1

Consider compressor $\mathcal{C}_t(\bm{G}_t) = \bm{P} \bm{P}^\top \bm{G}_t$ for $t = 1, \ldots, \tau - 1$, where $\bm{G}_t \in \mathbb{R}^{m \times n}$ is an arbitrary matrix and $\bm{P} \in \mathbb{R}^{m \times r}$ is a fixed projection matrix satisfying $\bm{P}^\top \bm{P} = \bm{I}_r$. With error

Figures (5)

  • Figure 1: Loss curves for pre-training LLaMA-130M using random projection and greedy projection as low-rank compressors with rank 32. Compression is applied starting from step 1,000. It can be observed that employing random projection in the early training phase introduces excessive noise, which leads to slower convergence.
  • Figure 2: Loss curves for pre-training LLaMA-350M using the original GaLore, GaLore with full optimizer states, and GreedyLore with and without error feedback, under a compression rank $r=32$.
  • Figure 3: Testing accuracy of pre-training ResNet-18 model on CIAFR-10 and CIFAR-100 dataset after training for 40 epochs.
  • Figure 4: Training loss of pre-training LLaMA-1B model on C4 dataset.
  • Figure 5: Peak memory of pre-training of LLaMA models.

Theorems & Definitions (26)

  • Definition 1: Contractive Compressor
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Theorem 1: GreedyLore Convergence with MSGD
  • ...and 16 more