On the Nonlinearity of Layer Normalization
Yunhao Ni, Yuxin Guo, Junlong Jia, Lei Huang
TL;DR
This work provides a theoretical and empirical study of the nonlinearity of Layer Normalization (LN). By introducing LN-Net, consisting of layerwise linear maps interleaved with LN, and the SSR/LSSR framework, the authors show that LN induces nonlinearity and that LN-Nets can achieve strong representation capacity, even with width as small as 3 and depth linear in the number of samples. They derive a constructive approach to classify arbitrary labelings and establish VC-dimension lower bounds, while also proposing group-based LN (LN-G) to amplify nonlinearity. Empirically, LN-G enhances capacity on random-label tasks and yields practical improvements in CNNs and Transformers, suggesting LN-G as a design principle for future architectures. The results motivate reevaluating normalization as a potential source of nonlinearity and expressive power in deep networks.
Abstract
Layer normalization (LN) is a ubiquitous technique in deep learning but our theoretical understanding to it remains elusive. This paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We theoretically show that, given $m$ samples with any label assignment, an LN-Net with only 3 neurons in each layer and $O(m)$ LN layers can correctly classify them. We further show the lower bound of the VC dimension of an LN-Net. The nonlinearity of LN can be amplified by group partition, which is also theoretically demonstrated with mild assumption and empirically supported by our experiments. Based on our analyses, we consider to design neural architecture by exploiting and amplifying the nonlinearity of LN, and the effectiveness is supported by our experiments.
