Batch normalization does not improve initialization
Joris Dannemann, Gero Junike
TL;DR
This paper challenges the claim that Batch Normalization (BN) improves initialization. By constructing a minimal counterexample—a single-neuron network with identity activation and a specific 3-point dataset—the authors show that the BN loss can have a nonzero gradient at the standard local optimum $W^*$, contradicting the claimed initialization advantage. The counterexample relies on BN's scale invariance producing a line of BN-optima along $3w_1=5w_2$ with $w_1>0$, which implies the supposed inequality would force hatW* = W0, contradicting the nonzero gradient; consequently BN does not improve initialization. The work emphasizes careful theoretical analysis of BN's effects beyond empirical benefits.
Abstract
Batch normalization is one of the most important regularization techniques for neural networks, significantly improving training by centering the layers of the neural network. There have been several attempts to provide a theoretical justification for batch ormalization. Santurkar and Tsipras (2018) [How does batch normalization help optimization? Advances in neural information rocessing systems, 31] claim that batch normalization improves initialization. We provide a counterexample showing that this claim s not true, i.e., batch normalization does not improve initialization.
