Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees
Chuyan Chen, Yutong He, Pengrui Li, Weichen Jia, Kun Yuan
TL;DR
GreedyLore addresses the communication bottleneck in distributed stochastic optimization by introducing a greedy low-rank gradient compressor with error feedback and a semi-lazy SVD update. It also employs an approximate global Top-$r$ projection to better capture the structure of the global gradient. The authors prove convergence guarantees, achieving the rate $\mathcal{O}\left(\frac{\sigma}{\sqrt{NT}} + \frac{1}{T}\right)$ under MSGD and Adam and demonstrate a linear speedup in iteration complexity with the number of nodes, $N$. Empirical results across ResNet pre-training on CIFAR, LLaMA pre-training, and RoBERTa fine-tuning validate the method's superiority over prior low-rank and quantization-based compressors, with practical memory overhead and seamless integration into standard distributed training frameworks.
Abstract
Distributed optimization is pivotal for large-scale signal processing and machine learning, yet communication overhead remains a major bottleneck. Low-rank gradient compression, in which the transmitted gradients are approximated by low-rank matrices to reduce communication, offers a promising remedy. Existing methods typically adopt either randomized or greedy compression strategies: randomized approaches project gradients onto randomly chosen subspaces, introducing high variance and degrading empirical performance; greedy methods select the most informative subspaces, achieving strong empirical results but lacking convergence guarantees. To address this gap, we propose GreedyLore--the first Greedy Low-Rank gradient compression algorithm for distributed learning with rigorous convergence guarantees. GreedyLore incorporates error feedback to correct the bias introduced by greedy compression and introduces a semi-lazy subspace update that ensures the compression operator remains contractive throughout all iterations. With these techniques, we prove that GreedyLore achieves a convergence rate of $\mathcal{O}(σ/\sqrt{NT} + 1/T)$ under standard optimizers such as MSGD and Adam--marking the first linear speedup convergence rate for low-rank gradient compression. Extensive experiments are conducted to validate our theoretical findings.
