Table of Contents
Fetching ...

The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks

Khoat Than

TL;DR

The paper tackles the lack of theoretical understanding of why normalization helps deep networks generalize. It develops a Lipschitz-based framework showing that unnormalized DNNs can have Lipschitz constants that grow exponentially with depth, leading to highly expressive but unstable models, while inserting normalization layers reduces these constants exponentially with the number of normalizers, especially under large input variances, thereby smoothing optimization and constraining capacity. The authors prove that multiple normalizers yield exponential loss-smoothness and present a local Lipschitz generalization bound that explains improvement in unseen data performance, even when global Lipschitz continuity fails. Empirically, they observe increasing input variances and weight norms during training, consistent with their theory, and demonstrate the substantial stabilization and generalization benefits of BN/LN/GN in standard architectures like ResNet and EfficientNet. Overall, the work provides a principled, unified explanation for the empirical success of normalization in deep learning and offers a framework for designing normalization schemes that optimize capacity control and optimization dynamics.

Abstract

Normalization methods are fundamental components of modern deep neural networks (DNNs). Empirically, they are known to stabilize optimization dynamics and improve generalization. However, the underlying theoretical mechanism by which normalization contributes to both optimization and generalization remains largely unexplained, especially when using many normalization layers in a DNN architecture. In this work, we develop a theoretical framework that elucidates the role of normalization through the lens of capacity control. We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs, implying excessive functional capacity and potential overfitting. Such bad DNNs are uncountably many. In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization operations. This exponential reduction yields two fundamental consequences: (1) it smooths the loss landscape at an exponential rate, facilitating faster and more stable optimization; and (2) it constrains the effective capacity of the network, thereby enhancing generalization guarantees on unseen data. Our results thus offer a principled explanation for the empirical success of normalization methods in deep learning.

The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks

TL;DR

The paper tackles the lack of theoretical understanding of why normalization helps deep networks generalize. It develops a Lipschitz-based framework showing that unnormalized DNNs can have Lipschitz constants that grow exponentially with depth, leading to highly expressive but unstable models, while inserting normalization layers reduces these constants exponentially with the number of normalizers, especially under large input variances, thereby smoothing optimization and constraining capacity. The authors prove that multiple normalizers yield exponential loss-smoothness and present a local Lipschitz generalization bound that explains improvement in unseen data performance, even when global Lipschitz continuity fails. Empirically, they observe increasing input variances and weight norms during training, consistent with their theory, and demonstrate the substantial stabilization and generalization benefits of BN/LN/GN in standard architectures like ResNet and EfficientNet. Overall, the work provides a principled, unified explanation for the empirical success of normalization in deep learning and offers a framework for designing normalization schemes that optimize capacity control and optimization dynamics.

Abstract

Normalization methods are fundamental components of modern deep neural networks (DNNs). Empirically, they are known to stabilize optimization dynamics and improve generalization. However, the underlying theoretical mechanism by which normalization contributes to both optimization and generalization remains largely unexplained, especially when using many normalization layers in a DNN architecture. In this work, we develop a theoretical framework that elucidates the role of normalization through the lens of capacity control. We prove that an unnormalized DNN can exhibit exponentially large Lipschitz constants with respect to either its parameters or inputs, implying excessive functional capacity and potential overfitting. Such bad DNNs are uncountably many. In contrast, the insertion of normalization layers provably can reduce the Lipschitz constant at an exponential rate in the number of normalization operations. This exponential reduction yields two fundamental consequences: (1) it smooths the loss landscape at an exponential rate, facilitating faster and more stable optimization; and (2) it constrains the effective capacity of the network, thereby enhancing generalization guarantees on unseen data. Our results thus offer a principled explanation for the empirical success of normalization methods in deep learning.

Paper Structure

This paper contains 28 sections, 18 theorems, 40 equations, 3 figures.

Key Result

Lemma 1

Given $\epsilon>0$, let ${\bm{x}} = (x_1,..., x_n)$ be an input and ${\mathsf{BN}}({\bm{x}}, \epsilon)$ be the normalization of ${\bm{x}}$, where each input $x_k$ with population variance $\sigma_{k}^2$ is normalized as (eq-BN-01). Then $\| {\mathsf{BN}} \|_{Lip} = \|{1}/{\boldsymbol{\sigma}} \|$,

Figures (3)

  • Figure 1: The dynamics along the training process. The leftmost subfigures present the weight norm at each layer, the middle subfigures report the product of all weight norms, while the rightmost subfigures report the training accuracy. CIFAR10 dataset is used to train ResNet18 and a ReLU network with 10 layers. For ResNet18, only the dynamics of the first 11 weight matrices are presented for clarity. Detailed settings can be found in Appendix \ref{['app-empirical-evaluations']}.
  • Figure 2: Evolution of input variances across layers in a 10-layer ReLU network and ResNet18 trained on the CIFAR-10 dataset. The input variance $(\sigma^2)$ is computed over mini-batches before each activation function (or before each ${\mathsf{BN}}$ layer in ResNet18). For ResNet18, only the dynamics of the first 11 ${\mathsf{BN}}$ layers are presented for clarity. Although both networks are initialized using He initialization, several layers exhibit a rapid increase in input variance during training. Consequently, the cumulative product of layer-wise variances grows approximately exponentially.
  • Figure 3: Evolution of weight norms and input variances for some layers in EfficientNet-B3 trained on the CIFAR-10 dataset. The input variance $(\sigma^2)$ is computed over mini-batches before each ${\mathsf{BN}}$ layer.

Theorems & Definitions (23)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Definition 4: DNN
  • Lemma 5: Upper bound
  • Theorem 6: Lower bound
  • Remark 7
  • Definition 8: Normalized DNN
  • Theorem 9
  • Corollary 10: DNN+BN
  • ...and 13 more