Table of Contents
Fetching ...

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Hao Yu, Sen Yang, Shenghuo Zhu

TL;DR

This work analyzes why simple model averaging can match the convergence of parallel mini-batch SGD with far less communication in non-convex deep learning. It introduces Parallel Restarted SGD (PR-SGD), where workers perform local SGD for epochs and periodically average, and proves an $O(1/\sqrt{NT})$ convergence rate with a reduced communication footprint, requiring the averaging interval to satisfy $I \le T^{1/4}/N^{3/4}$. The paper extends the framework to time-varying learning rates and asynchronous, heterogeneous networks, showing the same favorable rate under practical conditions. Empirical results on ResNet20/CIFAR-10 corroborate the theory, demonstrating speedups from fewer communication rounds while maintaining accuracy.

Abstract

In distributed training of deep neural networks, parallel mini-batch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

TL;DR

This work analyzes why simple model averaging can match the convergence of parallel mini-batch SGD with far less communication in non-convex deep learning. It introduces Parallel Restarted SGD (PR-SGD), where workers perform local SGD for epochs and periodically average, and proves an convergence rate with a reduced communication footprint, requiring the averaging interval to satisfy . The paper extends the framework to time-varying learning rates and asynchronous, heterogeneous networks, showing the same favorable rate under practical conditions. Empirical results on ResNet20/CIFAR-10 corroborate the theory, demonstrating speedups from fewer communication rounds while maintaining accuracy.

Abstract

In distributed training of deep neural networks, parallel mini-batch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Paper Structure

This paper contains 10 sections, 8 theorems, 47 equations, 4 figures, 3 algorithms.

Key Result

Lemma 1

Under Assumption ass:basic, Algorithm alg:parallel-sgd ensures where $\overline{\mathbf{x}}^{t}$ is defined in eq:node-average-x and $G$ is the constant defined in Assumption ass:basic.

Figures (4)

  • Figure 1: An illustration of Algorithm \ref{['alg:parallel-sgd']} implemented in a $2$ worker heterogeneous network. Orange "syn" rectangles represent the procedures to compute the node average.
  • Figure 2: Left: A typical epoch of Algorithm \ref{['alg:parallel-sgd-heterogeneous']} in a heterogeneous network with $2$ workers. A wider rectangle means the SGD iteration takes a longer wall clock time. Right: Imagined extra SGD iterations with a $0$ stochastic gradient (in light blue rectangles) are added for the slow worker.
  • Figure 3: Training loss of ResNet20 over CIFAR10 on a machine with $8$ P100 GPUs. In all schemes, each worker uses a local batch size $32$ and momentum $0.9$. The initial learning at each worker is $0.1$ and is divided by $10$ when $8$ workers together access $150$ epochs and $275$ epochs of training data.
  • Figure 4: Test accuracy of ResNet20 over CIFAR10 on a machine with $8$ P100 GPUs. In all schemes, each worker uses a local batch size $32$ and momentum $0.9$. The initial learning at each worker is $0.1$ and is divided by $10$ when $8$ workers together access $150$ epochs and $275$ epochs of training data.

Theorems & Definitions (17)

  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Corollary 1
  • Remark 1
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • ...and 7 more