The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks
Khoat Than
TL;DR
The paper tackles the lack of theoretical understanding of why normalization helps deep networks generalize. It develops a Lipschitz-based framework showing that unnormalized DNNs can have Lipschitz constants that grow exponentially with depth, leading to highly expressive but unstable models, while inserting normalization layers reduces these constants exponentially with the number of normalizers, especially under large input variances, thereby smoothing optimization and constraining capacity. The authors prove that multiple normalizers yield exponential loss-smoothness and present a local Lipschitz generalization bound that explains improvement in unseen data performance, even when global Lipschitz continuity fails. Empirically, they observe increasing input variances and weight norms during training, consistent with their theory, and demonstrate the substantial stabilization and generalization benefits of BN/LN/GN in standard architectures like ResNet and EfficientNet. Overall, the work provides a principled, unified explanation for the empirical success of normalization in deep learning and offers a framework for designing normalization schemes that optimize capacity control and optimization dynamics.
Abstract
Normalization methods are fundamental components of modern deep neural networks (DNNs). Empirically, they are known to stabilize optimization dynamics and improve generalization. However, the underlying theoretical mechanism by which normalization contributes to both optimization and generalization remains largely unexplained, especially when using many normalization layers in a DNN architecture. In this work, we develop a theoretical framework that elucidates the role of normalization through the lens of capacity control. We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs, implying excessive functional capacity and potential overfitting. Such bad DNNs are uncountably many. In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization operations. This exponential reduction yields two fundamental consequences: (1) it smooths the loss landscape at an exponential rate, facilitating faster and more stable optimization; and (2) it constrains the effective capacity of the network, thereby enhancing generalization guarantees on unseen data. Our results thus offer a principled explanation for the empirical success of normalization methods in deep learning.
