Table of Contents
Fetching ...

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane

TL;DR

LoRDO addresses the memory and communication bottlenecks of distributed training by combining low-rank optimization with infrequent synchronization. It shows that naive global projections stagnate the optimization subspace, and introduces a full-rank quasi-hyperbolic momentum to restore subspace exploration while preserving low-rank efficiency. Across language-model scales from $125$M to $720$M parameters, LoRDO achieves near-parity with low-rank DDP and reduces communication by about $10\times$, with perplexity gap under $1\%$ and competitive downstream tasks. The method shines in very memory-constrained settings, enabling efficient pre-training on hardware with limited resources and offering scalable benefits for decentralized, bandwidth-limited environments.

Abstract

Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

TL;DR

LoRDO addresses the memory and communication bottlenecks of distributed training by combining low-rank optimization with infrequent synchronization. It shows that naive global projections stagnate the optimization subspace, and introduces a full-rank quasi-hyperbolic momentum to restore subspace exploration while preserving low-rank efficiency. Across language-model scales from M to M parameters, LoRDO achieves near-parity with low-rank DDP and reduces communication by about , with perplexity gap under and competitive downstream tasks. The method shines in very memory-constrained settings, enabling efficient pre-training on hardware with limited resources and offering scalable benefits for decentralized, bandwidth-limited environments.

Abstract

Distributed training of foundation models via is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose , a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. achieves near-parity with low-rank in language modeling and downstream tasks at model scales of M--M, while reducing communication by . Finally, we show that improves performance even more in very low-memory settings with small rank/batch size.
Paper Structure (56 sections, 26 equations, 15 figures, 4 tables)

This paper contains 56 sections, 26 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Global projection matrix pathologies. LoRDO-Global fails to learn when quasi-hyperbolic momentum terms have not been applied due to the projection bases failing to update throughout the duration of local training.
  • Figure 2: LoRDO with global projections offers superior resilience to small-batch regimes compared to the local projection method. Particularly under heavy memory constraints, which necessitate low ranks, LoRDO surpasses DDP.
  • Figure 3: Stable rank (left column) and spectral gap (right column) for the attention layers on the $16$M parameter model for LoRDO-Global (top row) and LoRDO-Local (bottom row) of LoRDO.
  • Figure 4: Ablation across number of workers and local batch size ($M \times B$) for $16$M parameter experiments where the global batch size (or effective batch size) is $64$. We present this ablation for both LoRDO-Global and Local and the difference $\Delta = PPX_{\text{Local}} - PPX_{\text{Global}}$. As predicted, we find that LoRDO-Local is more sensitive to changes in the local batch size.
  • Figure 5: Ablation across synchronization frequency for LoRDO variants and QHM terms. Lower ranks are more sensitive to delays in synchronization. In addition to offering more stable performance, QHM terms reduce this sensitivity with an increased $\beta_1$.
  • ...and 10 more figures