Table of Contents
Fetching ...

Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers

Shuning Shang, Xuran Meng, Yuan Cao, Difan Zou

TL;DR

This work extends the analysis to two-layer ReLU convolutional neural networks with fully trainable layers and identifies the sharp conditions on the initialization scaling and signal-to-noise ratio (SNR) in which the benign overfitting can be achieved or not.

Abstract

Benign overfitting refers to how over-parameterized neural networks can fit training data perfectly and generalize well to unseen data. While this has been widely investigated theoretically, existing works are limited to two-layer networks with fixed output layers, where only the hidden weights are trained. We extend the analysis to two-layer ReLU convolutional neural networks (CNNs) with fully trainable layers, which is closer to the practice. Our results show that the initialization scaling of the output layer is crucial to the training dynamics: large scales make the model training behave similarly to that with the fixed output, the hidden layer grows rapidly while the output layer remains largely unchanged; in contrast, small scales result in more complex layer interactions, the hidden layer initially grows to a specific ratio relative to the output layer, after which both layers jointly grow and maintain that ratio throughout training. Furthermore, in both settings, we provide nearly matching upper and lower bounds on the test errors, identifying the sharp conditions on the initialization scaling and signal-to-noise ratio (SNR) in which the benign overfitting can be achieved or not. Numerical experiments back up the theoretical results.

Initialization Matters: On the Benign Overfitting of Two-Layer ReLU CNN with Fully Trainable Layers

TL;DR

This work extends the analysis to two-layer ReLU convolutional neural networks with fully trainable layers and identifies the sharp conditions on the initialization scaling and signal-to-noise ratio (SNR) in which the benign overfitting can be achieved or not.

Abstract

Benign overfitting refers to how over-parameterized neural networks can fit training data perfectly and generalize well to unseen data. While this has been widely investigated theoretically, existing works are limited to two-layer networks with fixed output layers, where only the hidden weights are trained. We extend the analysis to two-layer ReLU convolutional neural networks (CNNs) with fully trainable layers, which is closer to the practice. Our results show that the initialization scaling of the output layer is crucial to the training dynamics: large scales make the model training behave similarly to that with the fixed output, the hidden layer grows rapidly while the output layer remains largely unchanged; in contrast, small scales result in more complex layer interactions, the hidden layer initially grows to a specific ratio relative to the output layer, after which both layers jointly grow and maintain that ratio throughout training. Furthermore, in both settings, we provide nearly matching upper and lower bounds on the test errors, identifying the sharp conditions on the initialization scaling and signal-to-noise ratio (SNR) in which the benign overfitting can be achieved or not. Numerical experiments back up the theoretical results.

Paper Structure

This paper contains 38 sections, 43 theorems, 278 equations, 3 figures, 1 table.

Key Result

Theorem 1.1

There exists a threshold $\widetilde{v} = \mathrm{poly}(d, n, m, \sigma_p)$ for the initialization scale of output layer $v_0$ and $T^* = \mathrm{poly}(d, n, m, \sigma_p, \eta, v_0, \epsilon)$ that the training loss converges to $\epsilon$. Meanwhile,

Figures (3)

  • Figure 1: Illustration of the phase transition between small and large initialization of the output layer. The $x$-axis represents $v_0$, while the $y$-axis corresponds to $\|\bm{\mu}\|_2^{-1}$.
  • Figure 2: Figure \ref{['fig: exp1_1']} is the truncated heatmap of test error on synthetic data under different $v_0$, where accuracy higher than 0.95 is colored blue, otherwise is colored yellow. The shape of the contour aligns with our theoretical prediction in Figure \ref{['fig: sim']}, indicating a phase transition between small $v_0$ and large $v_0$ as predicted by Theorem \ref{['thm: single_phase']} and \ref{['thm: double_phase']}. Figure \ref{['fig: exp1_2']} is the truncated heatmap of test error under a large fixed $v_0=5$ with varying $\|\bm{\mu}\|_2$ and $d$, where the test accuracy higher than 0.8 is colored blue, otherwise is colored yellow. The contours of the test accuracy are straight lines in the spaces $(\sigma_p^4 d, n \|\bm{\mu}\|_2^4)$ which validates Theorem \ref{['thm: single_phase']}. Figure \ref{['fig: exp1_3']} is the truncated heatmap of test error under different $\|\bm{\mu}\|_2$ and $v_0$, where accuracy higher than 0.8 is colored blue, otherwise is colored yellow. The contours of the test accuracy are straight lines in the spaces $(v_0, \|\bm{\mu}\|_2^{-1})$ which validates Theorem \ref{['thm: double_phase']}.
  • Figure 3: Scales of the hidden layer ($\max_{i \in [n]} \langle\mathbf{w}_{y_i,r}^{(t)}, \bm{\xi}_i\rangle$, blue) and output layer ($\max_{i \in [n]} v^{(t)}{y_i,r,2}$, green) throughout the training process. These are referenced by the y-axis on the left. The red curve represents the ratio $\langle\mathbf{w}{y_i,r}^{(t)}, \bm{\xi}_i\rangle / v^{(t)}_{y_i,r,2}$, referenced by the y-axis on the right. When $v_0$ is large (Figure \ref{['fig: exp4_0']}), the ratio of the two layers keeps growing, indicating an imbalance. When $v_0$ is smaller (Figures \ref{['fig: exp4_1']} and \ref{['fig: exp4_2']}), the scales of the two layers grow consistently over time, and their ratio converges to a constant value, validating the "balanced" results predicted by our theory.

Theorems & Definitions (46)

  • Theorem 1.1: Informal
  • Definition 2.1: Data Model
  • Theorem 3.2
  • Theorem 3.3
  • Lemma 4.1: Dynamic of two intertwined sequences
  • Proposition 4.2
  • Definition 4.3
  • Proposition 4.4
  • Lemma 4.5
  • Lemma 4.6: Noise Memorization becomes balancing
  • ...and 36 more