Table of Contents
Fetching ...

From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency

Sizhe Dang, Jiaqi Shao, Xiaodong Zheng, Guang Dai, Yan Song, Haishan Ye

TL;DR

The paper introduces TSR-Adam, a two-sided low-rank communication scheme that synchronizes a compact core $C=U^ op G V$ for each matrix-shaped parameter block, reducing per-step communication from $O(mn)$ to $O(r^2)$ while preserving Adam momentum in the core space. It addresses peak communication during subspace refresh with a randomized SVD-based refresh that communicates only low-dimensional sketches, and extends the approach to embedding gradients with embedding-specific ranks. Empirically, TSR-Adam achieves substantial bytes-per-step reductions (up to ~13x pretraining, ~25x GLUE fine-tuning) with comparable performance to dense AdamW and prior low-rank methods, and a convergence analysis provides a stationarity guarantee under standard smoothness and projection assumptions. These results demonstrate practical, scalable improvements in communication efficiency for large-scale distributed training of language models, with memory benefits and embedding-aware design enhancing applicability.

Abstract

As foundation models continue to scale, pretraining increasingly relies on data-parallel distributed optimization, making bandwidth-limited gradient synchronization a key bottleneck. Orthogonally, projection-based low-rank optimizers were mainly designed for memory efficiency, but remain suboptimal for communication-limited training: one-sided synchronization still transmits an $O(rn)$ object for an $m\times n$ matrix gradient and refresh steps can dominate peak communicated bytes. We propose TSR, which brings two-sided low-rank communication to Adam-family updates (TSR-Adam) by synchronizing a compact core $U^\top G V\in\mathbb{R}^{r\times r}$, reducing the dominant per-step payload from $O(mn)$ to $O(r^2)$ while keeping moment states in low-dimensional cores. To further reduce the peak communication from subspace refresh, TSR-Adam adopts a randomized SVD-based refresh that avoids full-gradient synchronization. We additionally extend low-rank communication to embedding gradients with embedding-specific ranks and refresh schedules, yielding additional communication and memory savings over keeping embeddings dense. Across pretraining from 60M to 1B model scales, TSR-Adam reduces average communicated bytes per step by $13\times$, and on GLUE fine-tuning it reduces communication by $25\times$, while achieving comparable performance; we further provide a theoretical stationarity analysis for the proposed update. Code is available at https://github.com/DKmiyan/TSR-Adam.

From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency

TL;DR

The paper introduces TSR-Adam, a two-sided low-rank communication scheme that synchronizes a compact core for each matrix-shaped parameter block, reducing per-step communication from to while preserving Adam momentum in the core space. It addresses peak communication during subspace refresh with a randomized SVD-based refresh that communicates only low-dimensional sketches, and extends the approach to embedding gradients with embedding-specific ranks. Empirically, TSR-Adam achieves substantial bytes-per-step reductions (up to ~13x pretraining, ~25x GLUE fine-tuning) with comparable performance to dense AdamW and prior low-rank methods, and a convergence analysis provides a stationarity guarantee under standard smoothness and projection assumptions. These results demonstrate practical, scalable improvements in communication efficiency for large-scale distributed training of language models, with memory benefits and embedding-aware design enhancing applicability.

Abstract

As foundation models continue to scale, pretraining increasingly relies on data-parallel distributed optimization, making bandwidth-limited gradient synchronization a key bottleneck. Orthogonally, projection-based low-rank optimizers were mainly designed for memory efficiency, but remain suboptimal for communication-limited training: one-sided synchronization still transmits an object for an matrix gradient and refresh steps can dominate peak communicated bytes. We propose TSR, which brings two-sided low-rank communication to Adam-family updates (TSR-Adam) by synchronizing a compact core , reducing the dominant per-step payload from to while keeping moment states in low-dimensional cores. To further reduce the peak communication from subspace refresh, TSR-Adam adopts a randomized SVD-based refresh that avoids full-gradient synchronization. We additionally extend low-rank communication to embedding gradients with embedding-specific ranks and refresh schedules, yielding additional communication and memory savings over keeping embeddings dense. Across pretraining from 60M to 1B model scales, TSR-Adam reduces average communicated bytes per step by , and on GLUE fine-tuning it reduces communication by , while achieving comparable performance; we further provide a theoretical stationarity analysis for the proposed update. Code is available at https://github.com/DKmiyan/TSR-Adam.
Paper Structure (68 sections, 3 theorems, 113 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 68 sections, 3 theorems, 113 equations, 6 figures, 7 tables, 2 algorithms.

Key Result

Theorem 1

Suppose the above assumptions hold. Set $\eta = \frac{1}{L\,T^{2/3}}$ and choose $\beta^2 = 1-\sqrt{40L\eta}=1-\frac{\sqrt{40}}{T^{1/3}}$ (so $1-\beta^2=\frac{\sqrt{40}}{T^{1/3}}$). Define $\Delta f:=\mathbb{E}[f(w_0)-f^\star]$, let $\mathbb{I}_{\mathrm{refresh}}(t)\in\{0,1\}$ indicate whether step

Figures (6)

  • Figure 1: Bytes-to-Loss. Training loss as a function of cumulative communicated bytes for representative model scales. TSR-Adam reaches lower loss under the same communication budget compared with baselines. (a)--(c) correspond to three representative model scales.
  • Figure 2: Comparison of Communication Mechanisms. Visualizing synchronized objects (block volume) and bandwidth usage (arrow width). (a) Adam transmits dense gradients ($O(mn)$). (b) GaLore compresses linear layers ($O(rn)$) but leaves embeddings dense (large blue cube) and uses heavy SVD refresh. (c) TSR-Adam synchronizes tiny $r \times r$ cores ($O(r^2)$) across all layers and uses lightweight sketches (green bars) for refresh, achieving the lowest communication footprint.
  • Figure 3: Ablations. We isolate the effects of (a) one-sided vs two-sided compression, (b) randomized SVD-style refresh, and (c) subspace refresh interval $K$ on the loss--communication trade-off.
  • Figure 4: Loss--communication Pareto frontiers across model scales. Final pretraining loss versus communicated Bytes/Step for 60M/130M/350M/1B models. TSR shifts the frontier toward lower communication for competitive loss, relative to AdamW and GaLore.
  • Figure 5: Embedding matters. (a) Breakdown of bytes per step for the embedding and linear layers across different model sizes. (b) Loss–Bytes curves comparing the use of low-rank compression on the embedding layer.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 1: Stationarity bound
  • Remark 1: Interpretation of Convergence Terms
  • Lemma 1: Summation of a linear contraction recursion
  • proof
  • Lemma 2: Descent lemma for $w_{t+1}=w_t-\eta\widetilde{m}_t$
  • proof