From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency
Sizhe Dang, Jiaqi Shao, Xiaodong Zheng, Guang Dai, Yan Song, Haishan Ye
TL;DR
The paper introduces TSR-Adam, a two-sided low-rank communication scheme that synchronizes a compact core $C=U^ op G V$ for each matrix-shaped parameter block, reducing per-step communication from $O(mn)$ to $O(r^2)$ while preserving Adam momentum in the core space. It addresses peak communication during subspace refresh with a randomized SVD-based refresh that communicates only low-dimensional sketches, and extends the approach to embedding gradients with embedding-specific ranks. Empirically, TSR-Adam achieves substantial bytes-per-step reductions (up to ~13x pretraining, ~25x GLUE fine-tuning) with comparable performance to dense AdamW and prior low-rank methods, and a convergence analysis provides a stationarity guarantee under standard smoothness and projection assumptions. These results demonstrate practical, scalable improvements in communication efficiency for large-scale distributed training of language models, with memory benefits and embedding-aware design enhancing applicability.
Abstract
As foundation models continue to scale, pretraining increasingly relies on data-parallel distributed optimization, making bandwidth-limited gradient synchronization a key bottleneck. Orthogonally, projection-based low-rank optimizers were mainly designed for memory efficiency, but remain suboptimal for communication-limited training: one-sided synchronization still transmits an $O(rn)$ object for an $m\times n$ matrix gradient and refresh steps can dominate peak communicated bytes. We propose TSR, which brings two-sided low-rank communication to Adam-family updates (TSR-Adam) by synchronizing a compact core $U^\top G V\in\mathbb{R}^{r\times r}$, reducing the dominant per-step payload from $O(mn)$ to $O(r^2)$ while keeping moment states in low-dimensional cores. To further reduce the peak communication from subspace refresh, TSR-Adam adopts a randomized SVD-based refresh that avoids full-gradient synchronization. We additionally extend low-rank communication to embedding gradients with embedding-specific ranks and refresh schedules, yielding additional communication and memory savings over keeping embeddings dense. Across pretraining from 60M to 1B model scales, TSR-Adam reduces average communicated bytes per step by $13\times$, and on GLUE fine-tuning it reduces communication by $25\times$, while achieving comparable performance; we further provide a theoretical stationarity analysis for the proposed update. Code is available at https://github.com/DKmiyan/TSR-Adam.
