LoRDO: Distributed Low-Rank Optimization with Infrequent Communication
Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane
TL;DR
LoRDO addresses the memory and communication bottlenecks of distributed training by combining low-rank optimization with infrequent synchronization. It shows that naive global projections stagnate the optimization subspace, and introduces a full-rank quasi-hyperbolic momentum to restore subspace exploration while preserving low-rank efficiency. Across language-model scales from $125$M to $720$M parameters, LoRDO achieves near-parity with low-rank DDP and reduces communication by about $10\times$, with perplexity gap under $1\%$ and competitive downstream tasks. The method shines in very memory-constrained settings, enabling efficient pre-training on hardware with limited resources and offering scalable benefits for decentralized, bandwidth-limited environments.
Abstract
Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.
