Table of Contents
Fetching ...

MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch Normalization

Wen Fei, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong

TL;DR

This paper proposes a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training by simplifying BN regularization while keeping two fundamental impacts of BN layers, i.e., data decorrelation and adaptive learning rate.

Abstract

Substantial experiments have validated the success of Batch Normalization (BN) Layer in benefiting convergence and generalization. However, BN requires extra memory and float-point calculation. Moreover, BN would be inaccurate on micro-batch, as it depends on batch statistics. In this paper, we address these problems by simplifying BN regularization while keeping two fundamental impacts of BN layers, i.e., data decorrelation and adaptive learning rate. We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training. MimicNorm consists of only two light operations, including modified weight mean operations (subtract mean values from weight parameter tensor) and one BN layer before loss function (last BN layer). We leverage the neural tangent kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer, and consequently, leads to an enhanced convergence. The last BN layer provides autotuned learning rates and also improves accuracy. Experimental results show that MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption. The code is publicly available at https://github.com/Kid-key/MimicNorm.

MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch Normalization

TL;DR

This paper proposes a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training by simplifying BN regularization while keeping two fundamental impacts of BN layers, i.e., data decorrelation and adaptive learning rate.

Abstract

Substantial experiments have validated the success of Batch Normalization (BN) Layer in benefiting convergence and generalization. However, BN requires extra memory and float-point calculation. Moreover, BN would be inaccurate on micro-batch, as it depends on batch statistics. In this paper, we address these problems by simplifying BN regularization while keeping two fundamental impacts of BN layers, i.e., data decorrelation and adaptive learning rate. We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training. MimicNorm consists of only two light operations, including modified weight mean operations (subtract mean values from weight parameter tensor) and one BN layer before loss function (last BN layer). We leverage the neural tangent kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer, and consequently, leads to an enhanced convergence. The last BN layer provides autotuned learning rates and also improves accuracy. Experimental results show that MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption. The code is publicly available at https://github.com/Kid-key/MimicNorm.

Paper Structure

This paper contains 15 sections, 5 theorems, 22 equations, 10 figures, 3 tables.

Key Result

Theorem 1

[xiao2019disentangling] The limiting condition number $\kappa$ of NTK is characterized into three parts: 1) the ordered phase with $\chi_1<1$ has divergent $\kappa$, 2) the critical line with $\chi_1=1$ has condition number linear to training data-size $\kappa\sim|\mathcal{X}|$, 3) the chaotic phase

Figures (10)

  • Figure 1: Illustrative diagram of MimicNorm. Modern neural networks are composed of stacked basic units and a loss function, usually CrossEntropy for classification. To remove BN layers, we introduce weight mean for each convolutional layers and modify initialization to generate stable intermediate activations. Besides, we apply a learnable scalar multiplier for skip-connection to normalize residual. And one last BN layer is inserted before the loss function to adjust the learning rate.
  • Figure 2: Straight network correlates activations. We randomly select 30 channels in the 20-th and 50-th layers, and show their distributions at initial iteration. Notice that most values are located within boxes, we conclude that the intra-channel similarities and inter-channel variations increase with depth. In other words, different inputs generate similar activations. This correlated activations are results of the frozen NNGP kernel.
  • Figure 3: BN layer and weight mean operation unfreeze NNGP kernel. This figure plots activation distributions in the 50-th layers with BN layers (left) and weight mean operation (right). Activation in each channel follows the same distribution.
  • Figure 4: Transition operator for networks with and without weight mean operation. Operator $\mathcal{T}$ for straight network (blue curve) has stable fixed point at $\rho^*=1$ indicating correlated inputs and the frozen NNGP kernel. While the stable fixed point of operator $\bar{\mathcal{T}}$ (orange curve) locates at $\rho^*=0$ and mean weight turns network into chaotic regime as the criterion $\chi_1=\dot{\bar{\mathcal{T}}}(1)>1$.
  • Figure 5: Batch Normalization suppresses variance in $l$-th residual branch by a factor $1/l$.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Lemma 1
  • Theorem 3