Table of Contents
Fetching ...

Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, Peter Bartlett

TL;DR

Gradient diversity is introduced to explain speedup saturation in distributed mini-batch SGD. The paper defines a data-dependent batch-size bound B_S(w) and proves convergence results across strongly convex, convex, smooth nonconvex, and PL objectives, plus a worst-case lower bound showing when the bound is violated. It extends the analysis to stability and generalization via differential gradient diversity and introduces DIM heuristics (dropout, Langevin dynamics, quantization) to boost diversity. Experimental results on logistic regression and CIFAR-10 neural nets show that higher gradient diversity enables larger batch-sizes with preserved convergence and generalization.

Abstract

It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notion of gradient diversity that measures the dissimilarity between concurrent gradient updates, and show its key role in the performance of mini-batch SGD. We prove that on problems with high gradient diversity, mini-batch SGD is amenable to better speedups, while maintaining the generalization performance of serial (one sample) SGD. We further establish lower bounds on convergence where mini-batch SGD slows down beyond a particular batch-size, solely due to the lack of gradient diversity. We provide experimental evidence indicating the key role of gradient diversity in distributed learning, and discuss how heuristics like dropout, Langevin dynamics, and quantization can improve it.

Gradient Diversity: a Key Ingredient for Scalable Distributed Learning

TL;DR

Gradient diversity is introduced to explain speedup saturation in distributed mini-batch SGD. The paper defines a data-dependent batch-size bound B_S(w) and proves convergence results across strongly convex, convex, smooth nonconvex, and PL objectives, plus a worst-case lower bound showing when the bound is violated. It extends the analysis to stability and generalization via differential gradient diversity and introduces DIM heuristics (dropout, Langevin dynamics, quantization) to boost diversity. Experimental results on logistic regression and CIFAR-10 neural nets show that higher gradient diversity enables larger batch-sizes with preserved convergence and generalization.

Abstract

It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notion of gradient diversity that measures the dissimilarity between concurrent gradient updates, and show its key role in the performance of mini-batch SGD. We prove that on problems with high gradient diversity, mini-batch SGD is amenable to better speedups, while maintaining the generalization performance of serial (one sample) SGD. We further establish lower bounds on convergence where mini-batch SGD slows down beyond a particular batch-size, solely due to the lack of gradient diversity. We provide experimental evidence indicating the key role of gradient diversity in distributed learning, and discuss how heuristics like dropout, Langevin dynamics, and quantization can improve it.

Paper Structure

This paper contains 44 sections, 23 theorems, 119 equations, 3 figures, 1 table.

Key Result

Theorem 1

For generalized linear functions, $\forall~\mathbf{w}\in\mathcal{W}$, we have

Figures (3)

  • Figure 1: Speedup gains for a single data pass and various batch-sizes, for a cuda-convnet variant model on CIFAR-10.
  • Figure 3: Data replication. (a) Logistic regression with two classes of CIFAR-10 (b) Cuda convolutional neural network (c) Residual network. For (a), we plot the average loss ratio during all the iterations of the algorithm, and average over 10 experiments; for (b), (c), we plot the loss ratio as a function of the number of passes over the entire dataset, and average over 3 experiments. Step-sizes are tuned to get fastest convergence for each batch-size.
  • Figure 4: Stability. (a) Normalized Euclidean distance vs number of data passes. (b) Generalization behavior of batch-size 512. (c) Generalization behavior of batch-size 1024. Results are averaged over 3 experiments

Theorems & Definitions (31)

  • Definition 1: $\beta$-smooth
  • Definition 2: $\lambda$-strongly convex
  • Definition 3: $\mu$-Polyak-Łojasiewicz (PL)
  • Definition 4: Gradient Diversity
  • Definition 5: Batch-size Bound
  • Theorem 1
  • Corollary 1
  • Corollary 2
  • Theorem 2
  • Lemma 1
  • ...and 21 more