Batch Normalization Decomposed
Ido Nachum, Marco Bondaschi, Michael Gastpar, Anatoly Khina
TL;DR
This work investigates Batch Normalization beyond the RS component by analyzing recentering (RC) and non-linearity (NL) effects on initialization geometry. Using a simplified indicative model and a suite of theorems, it reveals that RC+ReLU drives the batch to a two-cluster geometry with a single orthogonal outlier, and it proves rank growth under random ReLU layers, as well as invariant representations under RC+NL with random weights. The results formalize observed clustering and angular behavior and provide stability guarantees for random architectures, offering insights into BN initialization and potential sparse, orthogonal representations. Overall, the paper advances understanding of BN-induced representation geometry and informs initialization and architectural choices to improve learning dynamics.
Abstract
\emph{Batch normalization} is a successful building block of neural network architectures. Yet, it is not well understood. A neural network layer with batch normalization comprises three components that affect the representation induced by the network: \emph{recentering} the mean of the representation to zero, \emph{rescaling} the variance of the representation to one, and finally applying a \emph{non-linearity}. Our work follows the work of Hadi Daneshmand, Amir Joudaki, Francis Bach [NeurIPS~'21], which studied deep \emph{linear} neural networks with only the rescaling stage between layers at initialization. In our work, we present an analysis of the other two key components of networks with batch normalization, namely, the recentering and the non-linearity. When these two components are present, we observe a curious behavior at initialization. Through the layers, the representation of the batch converges to a single cluster except for an odd data point that breaks far away from the cluster in an orthogonal direction. We shed light on this behavior from two perspectives: (1) we analyze the geometrical evolution of a simplified indicative model; (2) we prove a stability result for the aforementioned~configuration.
