Table of Contents
Fetching ...

Taming Latency and Bandwidth: A Theoretical Framework and Adaptive Algorithm for Communication-Constrained Training

Rongwei Lu, Jingyan Jiang, Chunyang Li, Xingguang Wei, Zhi Wang

TL;DR

The paper tackles the challenge of training large-scale models across WANs under high latency and limited bandwidth by developing a theoretical framework and an adaptive algorithm. It introduces Nested Virtual Sequences to decouple compression and staleness in DD-EF-SGD, and derives convergence rates for non-convex and strongly convex cases. Building on these insights, it proposes DeCo-SGD, which jointly optimizes gradient compression and delay to minimize end-to-end training time, implemented efficiently via a LUT-based lookup. Experiments on CIFAR-10, ImageNet, and Wikitext demonstrate substantial speedups over baselines and robust performance under varying bandwidth and latency, including non-IID data settings.

Abstract

Regional energy caps limit the growth of any single data center used for large-scale model training. This single-center training paradigm works when model size remains manageable, but exponential growth in the model size and computational demand challenges it. A natural alternative is to distribute training across multiple data centers over wide-area networks. This pools distributed resources, but suffers from high latency and low, time-varying bandwidth, sharply reducing throughout. Employing jointly gradient compression and delayed aggregation can alleviate communication problems, but introduces a complex three-way trade-off among compression ratio, staleness (delayed synchronization steps), and convergence rate. Existing work lacks theoretical guidance and can only propose fixed strategies, insensitive to computation and communication conditions. We address this with a new theoretical tool, decomposing the joint optimization problem into a traditional process plus multiple analyzable noise terms. Our analysis yields the first convergence rate for this setting and shows that increasing staleness exponentially amplifies the detrimental effect of compression. Leveraging these insights, we propose DeCo-SGD, which dynamically selects the compression ratio and staleness based on the real-time communication and computation conditions. DeCo-SGD achieves up to $5.07\times$ and $1.37\times$ speed-ups over distributed SGD and static strategy in high-latency and low, varying bandwidth networks, respectively.

Taming Latency and Bandwidth: A Theoretical Framework and Adaptive Algorithm for Communication-Constrained Training

TL;DR

The paper tackles the challenge of training large-scale models across WANs under high latency and limited bandwidth by developing a theoretical framework and an adaptive algorithm. It introduces Nested Virtual Sequences to decouple compression and staleness in DD-EF-SGD, and derives convergence rates for non-convex and strongly convex cases. Building on these insights, it proposes DeCo-SGD, which jointly optimizes gradient compression and delay to minimize end-to-end training time, implemented efficiently via a LUT-based lookup. Experiments on CIFAR-10, ImageNet, and Wikitext demonstrate substantial speedups over baselines and robust performance under varying bandwidth and latency, including non-IID data settings.

Abstract

Regional energy caps limit the growth of any single data center used for large-scale model training. This single-center training paradigm works when model size remains manageable, but exponential growth in the model size and computational demand challenges it. A natural alternative is to distribute training across multiple data centers over wide-area networks. This pools distributed resources, but suffers from high latency and low, time-varying bandwidth, sharply reducing throughout. Employing jointly gradient compression and delayed aggregation can alleviate communication problems, but introduces a complex three-way trade-off among compression ratio, staleness (delayed synchronization steps), and convergence rate. Existing work lacks theoretical guidance and can only propose fixed strategies, insensitive to computation and communication conditions. We address this with a new theoretical tool, decomposing the joint optimization problem into a traditional process plus multiple analyzable noise terms. Our analysis yields the first convergence rate for this setting and shows that increasing staleness exponentially amplifies the detrimental effect of compression. Leveraging these insights, we propose DeCo-SGD, which dynamically selects the compression ratio and staleness based on the real-time communication and computation conditions. DeCo-SGD achieves up to and speed-ups over distributed SGD and static strategy in high-latency and low, varying bandwidth networks, respectively.

Paper Structure

This paper contains 42 sections, 20 theorems, 95 equations, 30 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ be $L$-smooth. There exists a stepsize $\gamma \leq \min \{ \frac{1}{4L\tau}, \frac{1}{4LZ\sqrt{\phi / \delta}}\}$, where $\phi=\frac{1-\delta}{\delta (1-\frac{\delta}{2})^{\tau}}$, such that at most iterations of DD-EF-SGD, it holds $\mathbb{E}\lVert \nabla f(\mathbf{x}_{out}) \rVert^2 \leq \epsilon$, and $\mathbf{x}_{out} = \mathbf{x}_t$ denotes an i

Figures (30)

  • Figure 1: Pioneering LLM training over WANs. DeepLink connects compute clusters in Shanghai and Ji'nan over $1{,}500$ km, with an estimated throughput efficiency of $35\%$. Intellect-1 trains across continents, where the corresponding throughput efficiency drops to approximately $15\%$.
  • Figure 2: The heatmap of throughput efficiency ($\%$) for D-SGD with four nodes training GPT-2, under different latency and bandwidth conditions. Each node is equipped with one A$40$ GPU. The throughput efficiency at (x, y) is defined as the throughput at (x, y) divided by the maximum achievable throughput of the machines.
  • Figure 3: The running timelines for D-SGD and D-SGD with communication optimziation strategies. Both D-SGD and D-SGD with gradient compression are serial processes, with gradient compression reducing transmission time. D-SGD with delayed aggregation and DD-EF-SGD operate in parallel.
  • Figure 4: The high level design of DeCo-SGD. Our design aims to minimize the end-to-end time while achieving the target accuracy. DeCo-SGD adaptively adjusts the compression ratio $\delta$ and delay staleness $\tau$ based on the dynamic network conditions monitored by the worker.
  • Figure 5: System implementation of DeCo-SGD. Each worker has: (1) Local compute unit to local compute and compress gradients; (2) Network monitor to get the bandwidth and latency; The server has a cache to save staled updates, the compute unit to update the model based on staled updates and DeCo Unit to decide ($\delta, \tau$) based on the network conditions. Workers and the server communicate with each other with Comm. API.
  • ...and 25 more figures

Theorems & Definitions (26)

  • Theorem 1: Non-convex convergence rate of DD-EF-SGD
  • Remark 1: $\phi$ determining the convergence in the non-degradation condition
  • Remark 2: Degradation condition
  • Theorem 2: Convex convergence rate of DD-EF-SGD
  • Remark 3
  • Theorem 3: Estimation of $T_{\text{avg}}$ in DD-EF-SGD
  • Remark 4: The locally optimal compression ratio $\delta^*(\tau)$
  • Lemma 1
  • Lemma 2: The nature of Top-$k$
  • Lemma 3: Lemma 27 of the work stich2020communication
  • ...and 16 more