Table of Contents
Fetching ...

Variance Reduction in SGD by Distributed Importance Sampling

Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, Yoshua Bengio

TL;DR

This work introduces ISSGD, a distributed training framework where separate workers actively search for the most informative training examples and a central updater trains on samples selected via importance sampling. By aligning the sampling distribution with the gradient norm, ISSGD achieves an unbiased gradient estimate with reduced variance, enabling faster convergence and potential communication efficiency gains. The authors provide theory for optimal importance-sampling proposals in both scalar and vector settings, practical minibatch gradient-norm computations, and a distributed implementation with an oracle and a master-worker architecture. Experiments on a permutation-invariant SVHN task show significant variance reduction and faster training, while discussing future work to extend the method to convolutional and recurrent architectures and to integrate with ASGD. Overall, the paper demonstrates a promising data-centric approach to distributed deep learning that focuses learning on the most informative examples to accelerate training.

Abstract

Humans are able to accelerate their learning by selecting training materials that are the most informative and at the appropriate level of difficulty. We propose a framework for distributing deep learning in which one set of workers search for the most informative examples in parallel while a single worker updates the model on examples selected by importance sampling. This leads the model to update using an unbiased estimate of the gradient which also has minimum variance when the sampling proposal is proportional to the L2-norm of the gradient. We show experimentally that this method reduces gradient variance even in a context where the cost of synchronization across machines cannot be ignored, and where the factors for importance sampling are not updated instantly across the training set.

Variance Reduction in SGD by Distributed Importance Sampling

TL;DR

This work introduces ISSGD, a distributed training framework where separate workers actively search for the most informative training examples and a central updater trains on samples selected via importance sampling. By aligning the sampling distribution with the gradient norm, ISSGD achieves an unbiased gradient estimate with reduced variance, enabling faster convergence and potential communication efficiency gains. The authors provide theory for optimal importance-sampling proposals in both scalar and vector settings, practical minibatch gradient-norm computations, and a distributed implementation with an oracle and a master-worker architecture. Experiments on a permutation-invariant SVHN task show significant variance reduction and faster training, while discussing future work to extend the method to convolutional and recurrent architectures and to integrate with ASGD. Overall, the paper demonstrates a promising data-centric approach to distributed deep learning that focuses learning on the most informative examples to accelerate training.

Abstract

Humans are able to accelerate their learning by selecting training materials that are the most informative and at the appropriate level of difficulty. We propose a framework for distributing deep learning in which one set of workers search for the most informative examples in parallel while a single worker updates the model on examples selected by importance sampling. This leads the model to update using an unbiased estimate of the gradient which also has minimum variance when the sampling proposal is proportional to the L2-norm of the gradient. We show experimentally that this method reduces gradient variance even in a context where the cost of synchronization across machines cannot be ignored, and where the factors for importance sampling are not updated instantly across the training set.

Paper Structure

This paper contains 23 sections, 6 theorems, 38 equations, 4 figures, 1 table.

Key Result

Theorem 1

Let $\mathcal{X}$ be a random variable in $\mathbb{R}^{d_1}$ and $f(x)$ be any function from $\mathbb{R}^{d_1}$ to $\mathbb{R}^{d_2}$. Let $p(x)$ be the probability density function of $\mathcal{X}$, and let $q(x)$ be a valid proposal distribution for importance sampling with the goal of estimating The context requires that $q(x) > 0$ whenever $p(x) > 0$. We know that the importance sampling esti

Figures (4)

  • Figure 1: The actual distributed training experiment that we run relies on 3 kinds of actors. We have one master process that is running ISSGD. We have one database process in charge of storing and exchanging all kinds of measurements, as well as the parameters when they are communicated by the master to the workers. We have multiple worker processes, each with one GPU, in charge of evaluating the quantities necessary for the master to do importance sampling. The master has to read the model parameters from the GPU before sending them to the database, and the workers also have to load them unto the GPU after receiving them. The horizontal dotted lines represent synchronization barriers that we can enforce to have an exact method, or that we can drop to have faster training in practice.
  • Figure 2: Here in the top two plots we compare the training loss optimized with two sets of hyperparameters. On the top-left we use a higher learning rate, but also a higher smoothing of the importance weights to stabilize the algorithm. In the two top plots, these are the actual quantities that are getting minimized by our procedure. We can see that, in both cases, ISSGD minimizes the loss more quickly than regular SGD, and it actually reaches 0.0. Those results are the median quantities reported during 50 runs for each set of hyperparamters, using a different random initialization. We also show the quartiles 1 and 3 in thinner lines to get an idea of the distributions. In the two bottom plots we also report the prediction error on the training set for each method. Note the different time scale between the left and the right.
  • Figure 3: Here we report the prediction error on the test set. Just like in figure \ref{['fig:usgd-vs-isgd-train-loss-2']}, we report the median results over 50 runs with the same two sets of hyperparameters. In a fairly consistent way, we have that one setup has a better generalization error for ISSGD (on the left plot), and the opposite happens in the other scenario (right plot). We believe that this can be explained by ISSGD converging quickly to a configuration that minimizes the loss perfectly, after which it just gives up trying to do better. Regular SGD, on plot (b), would appear to experience some kind of regularization due to its variance, and it would continue to optimize over the course of 6 hours instead of only one hour (as shown on plot \ref{['fig:usgd-vs-isgd-train-loss-2-b']}).
  • Figure 4: Square root of trace of covariance for different proposals $q$. We show here the median results aggregated over 50 runs of ISSGD. These plots come from the same hyperparameters used for figure \ref{['fig:usgd-vs-isgd-train-loss-2']}. On the left plot, we use a higher learning rate in the hopes of making convergence faster. This required the probability weights to be smoothed by adding a constant (+10.0) to all the probability weights, and this washed away a part of the variance-reduction benefits of using ISSGD. On the right plot, we used a smaller learning rate, and we still got comparably fast convergence. However, because of the additive constant +1.0 used, these runs were closer to the ideal ISSGD setting. The point of these plots is to show that with ISSGD we can a smaller measurement of $\mathop{\mathrm{Tr}}\nolimits(\Sigma(q))$. This happens clearly on the right plot, but not as convincingly on the left.

Theorems & Definitions (12)

  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • Proposition 1
  • proof
  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • ...and 2 more