Characterizing signal propagation to close the performance gap in unnormalized ResNets
Andrew Brock, Soham De, Samuel L. Smith
TL;DR
BatchNorm provides training stability but introduces batch-size sensitivity and hardware inconsistencies. The authors present Normalizer-Free ResNets powered by Scaled Weight Standardization and Signal Propagation Plots to analyze and stabilize forward signal without normalization layers. They derive nonlinearity-specific constants, adapt building blocks, and demonstrate competitive ImageNet performance for NF-ResNets and NF-RegNets across FLOP budgets, often outperforming BatchNorm in microbatch settings. This approach reduces batch-dependence and enables scalable, robust training of very deep networks without activation normalization, with practical implications for diverse hardware and training regimes.
Abstract
Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs. Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization. Our analysis tools show how this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with the state-of-the-art EfficientNets on ImageNet.
