Table of Contents
Fetching ...

Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization

Hancheng Min, Enrique Mallada, René Vidal

TL;DR

The paper addresses how gradient flow trains a two-layer ReLU network for binary classification when initialization is small and data are well-separated. It introduces a finite-$\epsilon$ analysis that identifies an early alignment phase where first-layer neurons align with data centers in cones $\mathcal{S}_+$ or $\mathcal{S}_-$, yielding a rigorous bound $t_1=O\left(\frac{\log n}{\sqrt{\mu}}\right)$. After alignment, training effectively decouples into two linear subnetworks, leading to $O\left(\frac{1}{t}\right)$ loss decay and a near-low-rank first-layer weight matrix, with a stable rank bound of at most $2$. The results are complemented by MNIST experiments that illustrate the predicted alignment and convergence dynamics, and the analysis clarifies the role of data separation $\mu$ and initialization scale in the training behavior and implicit bias of the model.

Abstract

This paper studies the problem of training a two-layer ReLU network for binary classification using gradient flow with small initialization. We consider a training dataset with well-separated input vectors: Any pair of input data with the same label are positively correlated, and any pair with different labels are negatively correlated. Our analysis shows that, during the early phase of training, neurons in the first layer try to align with either the positive data or the negative data, depending on its corresponding weight on the second layer. A careful analysis of the neurons' directional dynamics allows us to provide an $\mathcal{O}(\frac{\log n}{\sqrtμ})$ upper bound on the time it takes for all neurons to achieve good alignment with the input data, where $n$ is the number of data points and $μ$ measures how well the data are separated. After the early alignment phase, the loss converges to zero at a $\mathcal{O}(\frac{1}{t})$ rate, and the weight matrix on the first layer is approximately low-rank. Numerical experiments on the MNIST dataset illustrate our theoretical findings.

Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization

TL;DR

The paper addresses how gradient flow trains a two-layer ReLU network for binary classification when initialization is small and data are well-separated. It introduces a finite- analysis that identifies an early alignment phase where first-layer neurons align with data centers in cones or , yielding a rigorous bound . After alignment, training effectively decouples into two linear subnetworks, leading to loss decay and a near-low-rank first-layer weight matrix, with a stable rank bound of at most . The results are complemented by MNIST experiments that illustrate the predicted alignment and convergence dynamics, and the analysis clarifies the role of data separation and initialization scale in the training behavior and implicit bias of the model.

Abstract

This paper studies the problem of training a two-layer ReLU network for binary classification using gradient flow with small initialization. We consider a training dataset with well-separated input vectors: Any pair of input data with the same label are positively correlated, and any pair with different labels are negatively correlated. Our analysis shows that, during the early phase of training, neurons in the first layer try to align with either the positive data or the negative data, depending on its corresponding weight on the second layer. A careful analysis of the neurons' directional dynamics allows us to provide an upper bound on the time it takes for all neurons to achieve good alignment with the input data, where is the number of data points and measures how well the data are separated. After the early alignment phase, the loss converges to zero at a rate, and the weight matrix on the first layer is approximately low-rank. Numerical experiments on the MNIST dataset illustrate our theoretical findings.
Paper Structure (54 sections, 17 theorems, 127 equations, 17 figures)

This paper contains 54 sections, 17 theorems, 127 equations, 17 figures.

Key Result

Theorem 1

Given some initialization from eq_init, if $\epsilon=\mathcal{O}( \frac{1}{\sqrt{h}}\exp( -\frac{n}{\sqrt{\mu}}\log n))$, then for any regular solution to the gradient flow dynamics eq_gf, we have

Figures (17)

  • Figure 1: Illustration of $\frac{d}{dt}\frac{w_j}{\|w_j\|}$ during the early alignment phase. $x_1$ has $+1$ label, and $x_2,x_3$ have $-1$ labels, $x_1,x_2$ lie inside the halfspace $\left\langle x,w_j\right\rangle>0$ (gray shaded), thus $x_a(w_j)=x_1-x_2$. Since $\mathrm{sign}(v_j(0))>0$, GF pushes $w_j$ towards $x_a(w_j)$.
  • Figure 2: Neuron alignment under Assumption \ref{['assump_data']}. For neurons in $\mathcal{V}_+$, ① if it lies inside $\mathcal{S}_-$, then it gets repelled by $x_-$ and escapes $\mathcal{S}_-$; Once outside $\mathcal{S}_-$, it may ② get repelled by some negative data and eventually enters $\mathcal{S}_\text{dead}$, or may ③ gain some activation on positive data and eventually enter $\mathcal{S}_+$, then get constantly attracted by $x_+$.
  • Figure 3: For $j\in\mathcal{V}_+$, Assumption \ref{['assump_data']} enforces $\left\langle x_iy_i,x_a(w_j)\right\rangle>0$, thus GF pushes $w_j$ inward the halfspace $\left\langle x_iy_i,w_j\right\rangle>0$ at $\left\langle x_i,w_j\right\rangle=0$ (i.e. towards gaining activation on $x_i$, if $y_i=+1$, or losing activation on $x_i$, if $y_i=-1$.). $\mathcal{S}_{x_i}^\perp$ and $\mathcal{S}_{w_j}^\perp$ denotes the subspace orthogonal to $x_i$ and $w_j$, respectively.
  • Figure 4: Illustration of the activation pattern evolution. The epochs on the time axis denote the time $w_j$ changes its activation pattern by either losing one negative data (denoted by "$+$") or gaining one positive data (denoted by "$-$"). The markers are colored if it currently activates $w_j$. During the alignment phase $0\leq t\leq t_1$, a neuron $w_j, j\in\mathcal{V}_+$ starts with activation on all negative data and no positive data, every $\mathcal{O}\left( 1/n_a\right)$ time, it must change its activation, unless either ① it reaches $\mathcal{S}_\text{dead}$, or ② it activates some positive data at some epoch then eventually reaches $\mathcal{S}_+$.
  • Figure 5: Training two-layer ReLU network under small initialization for binary classification on MNIST digits $0$ and $1$. (First Plot) Data correlation $[\left\langle x_i,x_j\right\rangle]_{ij}$ as a heatmap, where the data are reordered by their label (digit 1 first, then digit 0); (Second Plot) Alignment between neurons and the aggregate positive/negative data $x_+=\sum_{i\in\mathcal{I}_+}x_i$, $x_-=\sum_{i\in\mathcal{I}_-}x_i$. (Third Plot) The loss $\mathcal{L}$, the stable rank and the squared spectral norm of $W$ during training; (Fourth Plot) Visualizing neuron centers $\bar{w}_+,\bar{w}_-$ and data centers $\bar{x}_+,\bar{x}_-$ (at iteration $15000$).
  • ...and 12 more figures

Theorems & Definitions (42)

  • Remark 1
  • Remark 2
  • Remark 3
  • Definition 1
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • proof
  • Remark 4
  • Lemma 1
  • ...and 32 more