Table of Contents
Fetching ...

A Quadratic Synchronization Rule for Distributed Deep Learning

Xinran Gu, Kaifeng Lyu, Sanjeev Arora, Jingzhao Zhang, Longbo Huang

TL;DR

The paper tackles the high communication cost of data-parallel training by introducing Quadratic Synchronization Rule (QSR), a dynamic schedule that scales the synchronization period as the learning rate decays via $H^{(s)} = \max \left\{ H_{\mathrm{base}}, \left\lfloor \left( \frac{\alpha}{\eta_t} \right)^2 \right\rfloor \right\}$, leveraging a quasistatic, SDE-informed view to balance optimization and generalization. Through Slow SDE analysis, the authors compare SGD, Local SGD with $H \sim \eta^{-1}$, and Local SGD with $H \sim \eta^{-2}$, arguing that QSR yields stronger implicit regularization by increasing the drift toward flatter minima; they prove an $O(\alpha^2)$ approximation error for the relevant moments. Empirically, QSR improves top-1 accuracy on ImageNet for ResNet-152 and ViT-B across cosine, linear, and step LR schedules, while dramatically reducing communication (to as low as ~9–25% of baseline) and shortening wall-clock training times on large-scale GPU clusters. The work demonstrates practical, theory-grounded gains in both generalization and efficiency, especially for large models trained with long horizons. This suggests QSR as a robust synchronization strategy for distributed deep learning in real-world, resource-constrained environments.

Abstract

In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for $H$ steps without synchronizing with others, hence reducing communication frequency. While $H$ has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper $H$ value can lead to generalization improvement. Yet, selecting a proper $H$ is elusive. This work proposes a theory-grounded method for determining $H$, named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting $H$ in proportion to $\frac{1}{η^2}$ as the learning rate $η$ decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves $1.16\%$ or $0.84\%$ higher top-1 validation accuracy.

A Quadratic Synchronization Rule for Distributed Deep Learning

TL;DR

The paper tackles the high communication cost of data-parallel training by introducing Quadratic Synchronization Rule (QSR), a dynamic schedule that scales the synchronization period as the learning rate decays via , leveraging a quasistatic, SDE-informed view to balance optimization and generalization. Through Slow SDE analysis, the authors compare SGD, Local SGD with , and Local SGD with , arguing that QSR yields stronger implicit regularization by increasing the drift toward flatter minima; they prove an approximation error for the relevant moments. Empirically, QSR improves top-1 accuracy on ImageNet for ResNet-152 and ViT-B across cosine, linear, and step LR schedules, while dramatically reducing communication (to as low as ~9–25% of baseline) and shortening wall-clock training times on large-scale GPU clusters. The work demonstrates practical, theory-grounded gains in both generalization and efficiency, especially for large models trained with long horizons. This suggests QSR as a robust synchronization strategy for distributed deep learning in real-world, resource-constrained environments.

Abstract

In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for steps without synchronizing with others, hence reducing communication frequency. While has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper value can lead to generalization improvement. Yet, selecting a proper is elusive. This work proposes a theory-grounded method for determining , named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting in proportion to as the learning rate decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared with the standard data parallel training, QSR enables Local AdamW on ViT-B to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves or higher top-1 validation accuracy.
Paper Structure (20 sections, 5 theorems, 51 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 20 sections, 5 theorems, 51 equations, 9 figures, 6 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $T > 0$ be a constant and $\bm{\zeta}(t)$ be the solution to one of the above Slow SDEs with the initial condition $\bm{\zeta}(0) = \Phi({\bm{\theta}}^{(0)}) \in \Gamma$. Let $g({\bm{\theta}})$ be any $\mathcal{C}^4$-smooth function. Here, $\mathcal{O}(\,\cdot\,)$ and $\mathcal{\tilde{O}}(\cdot)$ hide constants that are independent of $\alpha$ and $\eta$ but can depend on $g$ and $T$. $\mathc

Figures (9)

  • Figure 1: When training ResNet-152 and ViT-B on ImageNet with cosine learning rate decay, Local SGD/AdamW with QSR consistently outperforms data parallel methods or Local SGD/AdamW with other synchronization strategies in terms of top-1 validation accuracy, while only requiring 20.1% and 10.4% of the communication volume, respectively. With QSR, Local SGD on ResNet or Local AdamW on ViT cuts the training time from 20.7 to 18 hours or 26.7 to 20.2 hours on 16 GPUs, when compared with data parallel methods. We report the mean and the standard deviation over 3 runs. See \ref{['sec:exp details']} for training details.
  • Figure 2: Empirical results on Local SGD and Local AdamW validate the generalization performance order predicted by our theory: QSR $>$$\{H\sim \eta^{-1}\}$$>$ {constant $H$}. For SGD, we additionally have {constant $H$} $\approx$ {parallel SGD} since the latter is equivalent to Local SGD with $H=1$. Here, $\alpha$ and $\beta$ are tuned to maximize the test accuracy of QSR and $H\sim \eta^{-1}$, respectively.
  • Figure 3: For linear decay, QSR improves the test accuracy of Local AdamW on ViT-B, even outperforming the communication-intensive parallel AdamW.
  • Figure 4: A visualization of the learning rate schedules we investigate.
  • Figure 5: A visualization of the $H$ schedule for Local AdamW with a constant synchronization period $H=4$ and with QSR ${H}_{\mathrm{base}}=4, \alpha=0.0175$. The corresponding learning rate schedule is cosine decay with a peak learning rate of $0.008$. Adopting QSR improves the top-1 validation accuracy of Local AdamW on ViT-B from $79.32\%$ to $80.98\%$..
  • ...and 4 more figures

Theorems & Definitions (18)

  • Definition 3.1: Slow SDE for SGD, informal, li2021happensgu2023why
  • Definition 3.2: Slow SDE for Local SGD with $H\sim \eta^{-1}$, informal gu2023why
  • Definition 3.3: Slow SDE for Local SGD with QSR
  • Theorem 3.1: Weak Approximations
  • Definition E.1: Gradient Flow Projection
  • Definition E.2: Slow SDE for SGD, formal
  • Definition E.3: Slow SDE for Local SGD with $H \sim \eta^{-1}$, formal
  • Theorem E.1
  • proof
  • Lemma E.1
  • ...and 8 more