Multiple Importance Sampling for Stochastic Gradient Estimation
Corentin Salaün, Xingchang Huang, Iliyan Georgiev, Niloy J. Mitra, Gurprit Singh
TL;DR
This work tackles high-variance gradient estimation in SGD by introducing a self-adaptive importance sampling framework that dynamically evolves the sampling distribution during training. It extends importance sampling to vector-valued gradient estimation through multiple importance sampling (MIS) and optimal MIS (OptiMIS), enabling jointly weighted gradient contributions from multiple distributions without resampling. The authors propose practical algorithms (IS and OMIS) with momentum-based stabilization and gradient-based importance functions, achieving faster convergence on classification, regression, and point-cloud tasks. The approach yields improved gradient estimates with manageable overhead and is demonstrated to approach or match exact-gradient performance in controlled experiments, suggesting broad applicability for efficient optimization in neural networks.
Abstract
We introduce a theoretical and practical framework for efficient importance sampling of mini-batch samples for gradient estimation from single and multiple probability distributions. To handle noisy gradients, our framework dynamically evolves the importance distribution during training by utilizing a self-adaptive metric. Our framework combines multiple, diverse sampling distributions, each tailored to specific parameter gradients. This approach facilitates the importance sampling of vector-valued gradient estimation. Rather than naively combining multiple distributions, our framework involves optimally weighting data contribution across multiple distributions. This adapted combination of multiple importance yields superior gradient estimates, leading to faster training convergence. We demonstrate the effectiveness of our approach through empirical evaluations across a range of optimization tasks like classification and regression on both image and point cloud datasets.
