Table of Contents
Fetching ...

On the Nonlinearity of Layer Normalization

Yunhao Ni, Yuxin Guo, Junlong Jia, Lei Huang

TL;DR

This work provides a theoretical and empirical study of the nonlinearity of Layer Normalization (LN). By introducing LN-Net, consisting of layerwise linear maps interleaved with LN, and the SSR/LSSR framework, the authors show that LN induces nonlinearity and that LN-Nets can achieve strong representation capacity, even with width as small as 3 and depth linear in the number of samples. They derive a constructive approach to classify arbitrary labelings and establish VC-dimension lower bounds, while also proposing group-based LN (LN-G) to amplify nonlinearity. Empirically, LN-G enhances capacity on random-label tasks and yields practical improvements in CNNs and Transformers, suggesting LN-G as a design principle for future architectures. The results motivate reevaluating normalization as a potential source of nonlinearity and expressive power in deep networks.

Abstract

Layer normalization (LN) is a ubiquitous technique in deep learning but our theoretical understanding to it remains elusive. This paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We theoretically show that, given $m$ samples with any label assignment, an LN-Net with only 3 neurons in each layer and $O(m)$ LN layers can correctly classify them. We further show the lower bound of the VC dimension of an LN-Net. The nonlinearity of LN can be amplified by group partition, which is also theoretically demonstrated with mild assumption and empirically supported by our experiments. Based on our analyses, we consider to design neural architecture by exploiting and amplifying the nonlinearity of LN, and the effectiveness is supported by our experiments.

On the Nonlinearity of Layer Normalization

TL;DR

This work provides a theoretical and empirical study of the nonlinearity of Layer Normalization (LN). By introducing LN-Net, consisting of layerwise linear maps interleaved with LN, and the SSR/LSSR framework, the authors show that LN induces nonlinearity and that LN-Nets can achieve strong representation capacity, even with width as small as 3 and depth linear in the number of samples. They derive a constructive approach to classify arbitrary labelings and establish VC-dimension lower bounds, while also proposing group-based LN (LN-G) to amplify nonlinearity. Empirically, LN-G enhances capacity on random-label tasks and yields practical improvements in CNNs and Transformers, suggesting LN-G as a design principle for future architectures. The results motivate reevaluating normalization as a potential source of nonlinearity and expressive power in deep networks.

Abstract

Layer normalization (LN) is a ubiquitous technique in deep learning but our theoretical understanding to it remains elusive. This paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We theoretically show that, given samples with any label assignment, an LN-Net with only 3 neurons in each layer and LN layers can correctly classify them. We further show the lower bound of the VC dimension of an LN-Net. The nonlinearity of LN can be amplified by group partition, which is also theoretically demonstrated with mild assumption and empirically supported by our experiments. Based on our analyses, we consider to design neural architecture by exploiting and amplifying the nonlinearity of LN, and the effectiveness is supported by our experiments.
Paper Structure (73 sections, 33 theorems, 207 equations, 12 figures, 7 tables, 2 algorithms)

This paper contains 73 sections, 33 theorems, 207 equations, 12 figures, 7 tables, 2 algorithms.

Key Result

Proposition 1

Given ${\bm{X}}_1,{\bm{X}}_2 \in \mathbb{R}^{d_0\times m}$ and a linear neural network represented as $\tilde{\varphi} = \varphi_1 \circ \cdots \circ \varphi_L$, where $\varphi_l : \mathbb{R}^{d_{l-1}} \to \mathbb{R}^{d_{l}}, (l = 1, \cdots, L)$ are all linear transformations as shown in Eqn. eqn:Li

Figures (12)

  • Figure 1: Solution to the Xor Classification. To begin with, we rotate them by $45^\circ$, as shown in Figure \ref{['fig:1b']}. Then we vertically project them onto $y=0.5$, as shown in Figure \ref{['fig:1c']}. Next, we spherically project them onto the circle $x^2 + y^2 = 1$, as shown in Figure \ref{['fig:1d']}. Finally, we horizontally project them onto $x=0$, as shown in Figure \ref{['fig:1e']}. Now we have classified the two classes.
  • Figure 2: Get ${\bm{P}}^{(l + 1)}$ from ${\bm{P}}^{(l)}$ geometrically. In Figure \ref{['fig:2a']}, ${\bm{P}}^{(l)}$ is shown as the bars on the $x$-axis. At first, find the leftmost point, namely ${\bm{p}}_1^{(l)}$. Then we find another point with the same label as ${\bm{p}}_1^{(l)}$, but right of ${\bm{p}}_1^{(l)}$, choose the leftmost one, namely ${\bm{p}}_4^{(l)}$. Afterwards, shift all the points up by $(p_4^{(l)} - p_1^{(l)})/2$, and left by $(p_4^{(l)} + p_1^{(l)})/2$, then we get ${\bm{H}}^{(l)}$, as shown in Figure \ref{['fig:2a']}. Next, spherically project ${\bm{H}}^{(l)}$ onto the unit circle and get ${\bm{X}}^{(l)}$, shown as '+'s in Figure \ref{['fig:2b']}. Finally merge the points in ${\bm{X}}^{(l)}$ by their ordinates, as the new abscissas of ${\bm{P}}^{(l+1)}$, and take $0$ as the new ordinates of ${\bm{P}}^{(l+1)}$, as shown in Figure \ref{['fig:2c']}. Now, we have ${\bm{P}}^{(l+1)}$.
  • Figure 3: The case of confusion in the merging process.
  • Figure 4: Results of linear neural network and LN-Net on fitting random label. The black dashed line represents the upper bound accuracy of linear classifier. (a) Results on CIFAR-10-RL; (b) Results on MNIST-RL.
  • Figure 5: Results of LN-Net using LN-G. We vary the group number $g$ and show the training accuracy and $\mathcal{H}(f;{\mathbf{x}})$. (a) Training accuracy on CIFAR-10-RL; (b) Training accuracy on MNIST-RL; (c)$\mathcal{H}(f;{\mathbf{x}})$ on CIFAR-10-RL;(d) $\mathcal{H}(f;{\mathbf{x}})$ on MNIST-RL. The black dashed line in (a) and (b) has the same meaning as that in Figure \ref{['fig:Res-CIFAR10']}.
  • ...and 7 more figures

Theorems & Definitions (63)

  • Definition 1
  • Definition 2
  • Proposition 1
  • Definition 3
  • Proposition 2
  • Theorem 1
  • Corollary 1
  • proof
  • Lemma 1
  • Corollary 2
  • ...and 53 more