Variance Reduction in SGD by Distributed Importance Sampling
Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, Yoshua Bengio
TL;DR
This work introduces ISSGD, a distributed training framework where separate workers actively search for the most informative training examples and a central updater trains on samples selected via importance sampling. By aligning the sampling distribution with the gradient norm, ISSGD achieves an unbiased gradient estimate with reduced variance, enabling faster convergence and potential communication efficiency gains. The authors provide theory for optimal importance-sampling proposals in both scalar and vector settings, practical minibatch gradient-norm computations, and a distributed implementation with an oracle and a master-worker architecture. Experiments on a permutation-invariant SVHN task show significant variance reduction and faster training, while discussing future work to extend the method to convolutional and recurrent architectures and to integrate with ASGD. Overall, the paper demonstrates a promising data-centric approach to distributed deep learning that focuses learning on the most informative examples to accelerate training.
Abstract
Humans are able to accelerate their learning by selecting training materials that are the most informative and at the appropriate level of difficulty. We propose a framework for distributing deep learning in which one set of workers search for the most informative examples in parallel while a single worker updates the model on examples selected by importance sampling. This leads the model to update using an unbiased estimate of the gradient which also has minimum variance when the sampling proposal is proportional to the L2-norm of the gradient. We show experimentally that this method reduces gradient variance even in a context where the cost of synchronization across machines cannot be ignored, and where the factors for importance sampling are not updated instantly across the training set.
