Table of Contents
Fetching ...

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

Sergey Ioffe

TL;DR

The paper addresses the limitation of Batch Normalization when training with small or non-i.i.d. minibatches by introducing Batch Renormalization, which adds per-dimension correction factors that are fixed during gradient computation to align training and inference activations. This approach preserves BN’s advantages—fast training and initialization insensitivity—while improving performance on challenging minibatch regimes. Empirical results on ImageNet with Inception-v3 show BRN matches or slightly surpasses BN on standard minibatches and substantially improves training with small or biased minibatches, reducing overfitting to minibatch distributions. BRN is easy to implement, maintains consistent forward passes between training and inference, and is applicable to a wide range of architectures, including potential benefits for ResNets, GANs, and recurrent networks.

Abstract

Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d. minibatches. At the same time, Batch Renormalization retains the benefits of batchnorm such as insensitivity to initialization and training efficiency.

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

TL;DR

The paper addresses the limitation of Batch Normalization when training with small or non-i.i.d. minibatches by introducing Batch Renormalization, which adds per-dimension correction factors that are fixed during gradient computation to align training and inference activations. This approach preserves BN’s advantages—fast training and initialization insensitivity—while improving performance on challenging minibatch regimes. Empirical results on ImageNet with Inception-v3 show BRN matches or slightly surpasses BN on standard minibatches and substantially improves training with small or biased minibatches, reducing overfitting to minibatch distributions. BRN is easy to implement, maintains consistent forward passes between training and inference, and is applicable to a wide range of architectures, including potential benefits for ResNets, GANs, and recurrent networks.

Abstract

Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d. minibatches. At the same time, Batch Renormalization retains the benefits of batchnorm such as insensitivity to initialization and training efficiency.

Paper Structure

This paper contains 8 sections, 7 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: Validation top-1 accuracy of Inception-v3 model with batchnorm and its Batch Renorm version, trained on 50 synchronized workers, each processing minibatches of size 32. The Batch Renorm model achieves a marginally higher validation accuracy.
  • Figure 2: Validation accuracy for models trained with either batchnorm or Batch Renorm, where normalization is performed for sets of 4 examples (but with the gradients aggregated over all $50\times 32$ examples processed by the 50 workers). Batch Renorm allows the model to train faster and achieve a higher accuracy, although normalizing sets of 32 examples performs better.
  • Figure 3: Validation accuracy when training on non-i.i.d. minibatches, obtained by sampling 2 images for each of 16 (out of total 1000) random labels. This distribution bias results not only in a low test accuracy, but also low accuracy on the training set, with an eventual drop. This indicates overfitting to the particular minibatch distribution, which is confirmed by the improvement when the test minibatches also contain 2 images per label, and batchnorm uses minibatch statistics $\mu_\mathcal{B}$, $\sigma_\mathcal{B}$ during inference. It improves further if batchnorm is applied separately to 2 halves of a training minibatch, making each of them more i.i.d. Finally, by using Batch Renorm, we are able to just train and evaluate normally, and achieve the same validation accuracy as we get for i.i.d. minibatches in Fig. \ref{['fig-baseline']}.