Table of Contents
Fetching ...

Variance Reduced Local SGD with Lower Communication Complexity

Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, Yifei Cheng

TL;DR

VRL-SGD addresses the communication bottleneck in distributed non-convex optimization with non-identical data by introducing a variance-reduction mechanism into Local SGD. The algorithm augments local updates with a gradient-variance compensation term and periodic averaging, achieving a lower communication complexity of $O(T^{1/2} N^{3/2})$ while preserving a linear iteration speedup. Theoretical analysis provides bounds on the average gradient norm and demonstrates reduced sensitivity to data heterogeneity, complemented by a warm-up variant to suppress the drift term $C$. Empirically, VRL-SGD matches the convergence speed of S-SGD and outperforms Local SGD when data distributions across workers differ, across three standard ML tasks, indicating practical impact for federated and large-scale distributed training.

Abstract

To accelerate the training of machine learning models, distributed stochastic gradient descent (SGD) and its variants have been widely adopted, which apply multiple workers in parallel to speed up training. Among them, Local SGD has gained much attention due to its lower communication cost. Nevertheless, when the data distribution on workers is non-identical, Local SGD requires $O(T^{\frac{3}{4}} N^{\frac{3}{4}})$ communications to maintain its \emph{linear iteration speedup} property, where $T$ is the total number of iterations and $N$ is the number of workers. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to further reduce the communication complexity. Benefiting from eliminating the dependency on the gradient variance among workers, we theoretically prove that VRL-SGD achieves a \emph{linear iteration speedup} with a lower communication complexity $O(T^{\frac{1}{2}} N^{\frac{3}{2}})$ even if workers access non-identical datasets. We conduct experiments on three machine learning tasks, and the experimental results demonstrate that VRL-SGD performs impressively better than Local SGD when the data among workers are quite diverse.

Variance Reduced Local SGD with Lower Communication Complexity

TL;DR

VRL-SGD addresses the communication bottleneck in distributed non-convex optimization with non-identical data by introducing a variance-reduction mechanism into Local SGD. The algorithm augments local updates with a gradient-variance compensation term and periodic averaging, achieving a lower communication complexity of while preserving a linear iteration speedup. Theoretical analysis provides bounds on the average gradient norm and demonstrates reduced sensitivity to data heterogeneity, complemented by a warm-up variant to suppress the drift term . Empirically, VRL-SGD matches the convergence speed of S-SGD and outperforms Local SGD when data distributions across workers differ, across three standard ML tasks, indicating practical impact for federated and large-scale distributed training.

Abstract

To accelerate the training of machine learning models, distributed stochastic gradient descent (SGD) and its variants have been widely adopted, which apply multiple workers in parallel to speed up training. Among them, Local SGD has gained much attention due to its lower communication cost. Nevertheless, when the data distribution on workers is non-identical, Local SGD requires communications to maintain its \emph{linear iteration speedup} property, where is the total number of iterations and is the number of workers. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to further reduce the communication complexity. Benefiting from eliminating the dependency on the gradient variance among workers, we theoretically prove that VRL-SGD achieves a \emph{linear iteration speedup} with a lower communication complexity even if workers access non-identical datasets. We conduct experiments on three machine learning tasks, and the experimental results demonstrate that VRL-SGD performs impressively better than Local SGD when the data among workers are quite diverse.

Paper Structure

This paper contains 26 sections, 5 theorems, 51 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Theorem 5.1

Under Assumption assumptions, if the learning rate satisfies $\gamma \leq \frac{1}{2L}$ and $72k^2\gamma^2L^2\leq 1$, we have the following convergence result for VRL-SGD in Algorithm 1:

Figures (6)

  • Figure 1: Epoch loss for the non-identical case. VRL-SGD converges as fast as S-SGD, and Local SGD, EASGD converge slowly or even cannot converge.
  • Figure 2: Epoch loss for the identical case. All of the algorithms have a similar convergence rate.
  • Figure 3: Logarithm of distance to the global minimum for different b and communication period k.
  • Figure 4: Logarithm of variance among workers for different b and communication period k.
  • Figure 5: Epoch loss for the non-identical case. We set $k=10$ for LeNet, $k=25$ for TextCNN and $k=10$ for Transfer Learning.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Theorem 5.1
  • Corollary 5.2
  • Remark 5.3
  • Remark 5.4
  • Remark 5.5
  • Remark 5.6
  • Remark 5.7
  • Lemma 1
  • Lemma 2
  • Lemma 3