Table of Contents
Fetching ...

Batch Normalization Decomposed

Ido Nachum, Marco Bondaschi, Michael Gastpar, Anatoly Khina

TL;DR

This work investigates Batch Normalization beyond the RS component by analyzing recentering (RC) and non-linearity (NL) effects on initialization geometry. Using a simplified indicative model and a suite of theorems, it reveals that RC+ReLU drives the batch to a two-cluster geometry with a single orthogonal outlier, and it proves rank growth under random ReLU layers, as well as invariant representations under RC+NL with random weights. The results formalize observed clustering and angular behavior and provide stability guarantees for random architectures, offering insights into BN initialization and potential sparse, orthogonal representations. Overall, the paper advances understanding of BN-induced representation geometry and informs initialization and architectural choices to improve learning dynamics.

Abstract

\emph{Batch normalization} is a successful building block of neural network architectures. Yet, it is not well understood. A neural network layer with batch normalization comprises three components that affect the representation induced by the network: \emph{recentering} the mean of the representation to zero, \emph{rescaling} the variance of the representation to one, and finally applying a \emph{non-linearity}. Our work follows the work of Hadi Daneshmand, Amir Joudaki, Francis Bach [NeurIPS~'21], which studied deep \emph{linear} neural networks with only the rescaling stage between layers at initialization. In our work, we present an analysis of the other two key components of networks with batch normalization, namely, the recentering and the non-linearity. When these two components are present, we observe a curious behavior at initialization. Through the layers, the representation of the batch converges to a single cluster except for an odd data point that breaks far away from the cluster in an orthogonal direction. We shed light on this behavior from two perspectives: (1) we analyze the geometrical evolution of a simplified indicative model; (2) we prove a stability result for the aforementioned~configuration.

Batch Normalization Decomposed

TL;DR

This work investigates Batch Normalization beyond the RS component by analyzing recentering (RC) and non-linearity (NL) effects on initialization geometry. Using a simplified indicative model and a suite of theorems, it reveals that RC+ReLU drives the batch to a two-cluster geometry with a single orthogonal outlier, and it proves rank growth under random ReLU layers, as well as invariant representations under RC+NL with random weights. The results formalize observed clustering and angular behavior and provide stability guarantees for random architectures, offering insights into BN initialization and potential sparse, orthogonal representations. Overall, the paper advances understanding of BN-induced representation geometry and informs initialization and architectural choices to improve learning dynamics.

Abstract

\emph{Batch normalization} is a successful building block of neural network architectures. Yet, it is not well understood. A neural network layer with batch normalization comprises three components that affect the representation induced by the network: \emph{recentering} the mean of the representation to zero, \emph{rescaling} the variance of the representation to one, and finally applying a \emph{non-linearity}. Our work follows the work of Hadi Daneshmand, Amir Joudaki, Francis Bach [NeurIPS~'21], which studied deep \emph{linear} neural networks with only the rescaling stage between layers at initialization. In our work, we present an analysis of the other two key components of networks with batch normalization, namely, the recentering and the non-linearity. When these two components are present, we observe a curious behavior at initialization. Through the layers, the representation of the batch converges to a single cluster except for an odd data point that breaks far away from the cluster in an orthogonal direction. We shed light on this behavior from two perspectives: (1) we analyze the geometrical evolution of a simplified indicative model; (2) we prove a stability result for the aforementioned~configuration.

Paper Structure

This paper contains 6 sections, 8 theorems, 84 equations, 8 figures.

Key Result

Lemma 2

Let $\bm{x}_1, \bm{x}_2, \ldots \bm{x}_{t+1} \in \mathbb{R}^k$ be column vectors such that no two vectors are collinear. Denote $X^{(\ell)} \coloneqq \left( \bm{x}_1 \middle| \bm{x}_2 \middle| \cdots \middle| \bm{x}_\ell \right)$ for $\ell \in [t+1]$. Denote further $W^{(r)} \in \mathbb{R}^{r \times where $W^{(r+1)} = $ for ${\normalfont \textsf{w}}_{r+1} \sim \mathcal{N} \left( 0, I_k \right)$.

Figures (8)

  • Figure 1: A comparison between previous work and our contribution. Our contribution studies the effects of the ReLU non-linearity and recentering at initialization and how they interact.
  • Figure 2: Comparison of final training accuracy and the rank of the last hidden layer in a fully-connected ReLU network using the supplementary code of bach20: (1) with BN (2) without BN (3) without BN while changing only the default PyTorch initialization in the code to the He initialization.
  • Figure 3: The batch representation induced by the final hidden layer with RC and ReLU NL. Figure (a) is a random two-dimensional projection of the final layer's representations. The "escaped" point is marked in red. Figure (b) represents the angles between pairs of vector representations before and after the final layer. The points marked in red represent the angles between the "escaped" point and any other point of the batch.
  • Figure 4: Partial example of the first three layers of the tree generated by the process of positive/negative transformations analyzed in this paper, starting from a one-dimensional batch with $n=5$ elements. Different elements have different shapes, to make it easier to follow their change of position. The average of the vector at each step is denoted by $\bar{x}$.
  • Figure 5: Effect of recentering on the angles between pairs of data points before and after the 30th layer of the neural network. If Batch Normalization without recentering is used, the output angles are approximately $60^{\circ}$. With recentering, the angles increase to approximately $75^{\circ}$.
  • ...and 3 more figures

Theorems & Definitions (16)

  • Definition 1
  • Example
  • Lemma 2
  • Theorem 3
  • Corollary 4
  • Remark
  • Definition 5: positive and negative transformations
  • Definition 6
  • Lemma 7
  • Definition 8: stable
  • ...and 6 more