Table of Contents
Fetching ...

Not All Samples Are Created Equal: Deep Learning with Importance Sampling

Angelos Katharopoulos, François Fleuret

TL;DR

Deep neural network training spends substantial computation on uninformative samples. The authors introduce a principled importance sampling method based on a computable upper bound of the per-sample gradient norm, enabling variance reduction and a principled trigger to switch sampling on only when it yields speedups. The bound is inexpensive to compute via a forward pass and is paired with a pre-sampling strategy to estimate the impact on variance and wall-clock time. Across image classification, fine-tuning, and sequence modeling, the method delivers meaningful wall-clock speedups and improved generalization compared to uniform or loss-based sampling, demonstrating practical benefits for DL training efficiency.

Abstract

Deep neural network training spends most of the computation on examples that are properly handled, and could be ignored. We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on "informative" examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup. The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally, on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%.

Not All Samples Are Created Equal: Deep Learning with Importance Sampling

TL;DR

Deep neural network training spends substantial computation on uninformative samples. The authors introduce a principled importance sampling method based on a computable upper bound of the per-sample gradient norm, enabling variance reduction and a principled trigger to switch sampling on only when it yields speedups. The bound is inexpensive to compute via a forward pass and is paired with a pre-sampling strategy to estimate the impact on variance and wall-clock time. Across image classification, fine-tuning, and sequence modeling, the method delivers meaningful wall-clock speedups and improved generalization compared to uniform or loss-based sampling, demonstrating practical benefits for DL training efficiency.

Abstract

Deep neural network training spends most of the computation on examples that are properly handled, and could be ignored. We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on "informative" examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup. The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally, on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%.

Paper Structure

This paper contains 22 sections, 27 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: The y-axis denotes the $L_2$ distance of the average gradient of the large batch ($G_B$) and the average gradient of the small batch ($G_b$) normalized with the distance achieved by uniform sampling. The sampling of the small batch is done $10$ times and the reported results are the average. The details of the experimental setup are given in § \ref{['sec:ablation']}.
  • Figure 2: The probabilities generated with the loss and our upper-bound are plotted against the ideal probabilities produced by the gradient-norm. The black line denotes perfect correlation. The details of the experimental setup are given in § \ref{['sec:ablation']}.
  • Figure 3: Comparison of importance sampling using the upper-bound with uniform and loss based importance sampling. The details of the training procedure are given in § \ref{['sec:cifar']}. Our proposed scheme is the only one achieving a speedup on CIFAR100 and results in 5% smaller test error. All presented results are averaged across $3$ independent runs.
  • Figure 4: Comparison of importance sampling for fine-tuning on MIT67 dataset. The details of the training procedure are given in § \ref{['sec:mit67']}. Our proposed algorithm converges very quickly to $28.06\%$ test error in approximately half an hour, a relative reduction of $17\%$ to uniform sampling. For robustness, the results are averaged across $3$ independent runs.
  • Figure 5: Comparison of importance sampling on pixel-by-pixel MNIST with an LSTM. The details of the training procedure are given in § \ref{['sec:mnist']}. Our proposed algorithm speeds up training and achieves $7\%$ lower test error in one hour of training ($0.1055$ compared to $0.1139$). We observe that sampling proportionally to the loss actually hurts convergence in this case.
  • ...and 2 more figures