Taming Latency and Bandwidth: A Theoretical Framework and Adaptive Algorithm for Communication-Constrained Training
Rongwei Lu, Jingyan Jiang, Chunyang Li, Xingguang Wei, Zhi Wang
TL;DR
The paper tackles the challenge of training large-scale models across WANs under high latency and limited bandwidth by developing a theoretical framework and an adaptive algorithm. It introduces Nested Virtual Sequences to decouple compression and staleness in DD-EF-SGD, and derives convergence rates for non-convex and strongly convex cases. Building on these insights, it proposes DeCo-SGD, which jointly optimizes gradient compression and delay to minimize end-to-end training time, implemented efficiently via a LUT-based lookup. Experiments on CIFAR-10, ImageNet, and Wikitext demonstrate substantial speedups over baselines and robust performance under varying bandwidth and latency, including non-IID data settings.
Abstract
Regional energy caps limit the growth of any single data center used for large-scale model training. This single-center training paradigm works when model size remains manageable, but exponential growth in the model size and computational demand challenges it. A natural alternative is to distribute training across multiple data centers over wide-area networks. This pools distributed resources, but suffers from high latency and low, time-varying bandwidth, sharply reducing throughout. Employing jointly gradient compression and delayed aggregation can alleviate communication problems, but introduces a complex three-way trade-off among compression ratio, staleness (delayed synchronization steps), and convergence rate. Existing work lacks theoretical guidance and can only propose fixed strategies, insensitive to computation and communication conditions. We address this with a new theoretical tool, decomposing the joint optimization problem into a traditional process plus multiple analyzable noise terms. Our analysis yields the first convergence rate for this setting and shows that increasing staleness exponentially amplifies the detrimental effect of compression. Leveraging these insights, we propose DeCo-SGD, which dynamically selects the compression ratio and staleness based on the real-time communication and computation conditions. DeCo-SGD achieves up to $5.07\times$ and $1.37\times$ speed-ups over distributed SGD and static strategy in high-latency and low, varying bandwidth networks, respectively.
