Table of Contents
Fetching ...

Characterizing signal propagation to close the performance gap in unnormalized ResNets

Andrew Brock, Soham De, Samuel L. Smith

TL;DR

BatchNorm provides training stability but introduces batch-size sensitivity and hardware inconsistencies. The authors present Normalizer-Free ResNets powered by Scaled Weight Standardization and Signal Propagation Plots to analyze and stabilize forward signal without normalization layers. They derive nonlinearity-specific constants, adapt building blocks, and demonstrate competitive ImageNet performance for NF-ResNets and NF-RegNets across FLOP budgets, often outperforming BatchNorm in microbatch settings. This approach reduces batch-dependence and enables scalable, robust training of very deep networks without activation normalization, with practical implications for diverse hardware and training regimes.

Abstract

Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization. Our analysis tools show how this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with the state-of-the-art EfficientNets on ImageNet.

Characterizing signal propagation to close the performance gap in unnormalized ResNets

TL;DR

BatchNorm provides training stability but introduces batch-size sensitivity and hardware inconsistencies. The authors present Normalizer-Free ResNets powered by Scaled Weight Standardization and Signal Propagation Plots to analyze and stabilize forward signal without normalization layers. They derive nonlinearity-specific constants, adapt building blocks, and demonstrate competitive ImageNet performance for NF-ResNets and NF-RegNets across FLOP budgets, often outperforming BatchNorm in microbatch settings. This approach reduces batch-dependence and enables scalable, robust training of very deep networks without activation normalization, with practical implications for diverse hardware and training regimes.

Abstract

Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization. Our analysis tools show how this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with the state-of-the-art EfficientNets on ImageNet.

Paper Structure

This paper contains 33 sections, 3 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Signal Propagation Plot for a ResNetV2-600 at initialization with BatchNorm, ReLU activations and He init, in response to an $\mathcal{N}(0, 1)$ input at 512px resolution. Black dots indicate the end of a stage. Blue plots use the BN-ReLU-Conv ordering while red plots use ReLU-BN-Conv.
  • Figure 2: SPPs for three different variants of the ResNetV2-600 network (with ReLU activations). In red, we show a batch normalized network with ReLU-BN-Conv ordering. In green we show a normalizer-free network with He-init and $\alpha = 1$. In cyan, we show the same normalizer-free network but with Scaled Weight Standardization. We note that the SPPs for a normalizer-free network with Scaled Weight Standardization are almost identical to those for the batch normalized network.
  • Figure 3: ImageNet Top-1 test accuracy versus FLOPs.
  • Figure 4: Residual Blocks for pre-activation ResNets he2016identity. Note that some variants swap the order of the nonlinearity and the BatchNorm, resulting in signal propagation which is more similar to that of our normalizer-free networks.
  • Figure 5: Residual Blocks for post-activation (original) ResNets he2016resnets.
  • ...and 6 more figures