Table of Contents
Fetching ...

Disentangle Sample Size and Initialization Effect on Perfect Generalization for Single-Neuron Target

Jiajie Zhao, Zhiwei Bai, Yaoyu Zhang

TL;DR

This paper investigates how initialization scale and data sample size influence perfect generalization when recovering a single-neuron target with a two-layer network. It introduces the initial imbalance ratio and two sample-size thresholds—optimistic and separation—and provides both empirical and theoretical analyses of how these factors shape training dynamics under gradient flow. A key finding is that, under small initialization, the trajectory and convergence point are determined by the normalized initial imbalance vector $\frac{\mathbf{C}(\bm{\theta}_0)}{\|\mathbf{C}(\bm{\theta}_0)\|_2}$, with recovery behavior exhibiting phase transitions at the identified sample-size thresholds. The results extend to multi-neuron networks, where only a subset of neurons remains active, offering insights into generalization in overparameterized models. Overall, the work clarifies how initialization and data availability interact to yield perfect generalization in a simplified setting, providing a stepping stone toward understanding more complex target functions.

Abstract

Overparameterized models like deep neural networks have the intriguing ability to recover target functions with fewer sampled data points than parameters (see arXiv:2307.08921). To gain insights into this phenomenon, we concentrate on a single-neuron target recovery scenario, offering a systematic examination of how initialization and sample size influence the performance of two-layer neural networks. Our experiments reveal that a smaller initialization scale is associated with improved generalization, and we identify a critical quantity called the "initial imbalance ratio" that governs training dynamics and generalization under small initialization, supported by theoretical proofs. Additionally, we empirically delineate two critical thresholds in sample size--termed the "optimistic sample size" and the "separation sample size"--that align with the theoretical frameworks established by (see arXiv:2307.08921 and arXiv:2309.00508). Our results indicate a transition in the model's ability to recover the target function: below the optimistic sample size, recovery is unattainable; at the optimistic sample size, recovery becomes attainable albeit with a set of initialization of zero measure. Upon reaching the separation sample size, the set of initialization that can successfully recover the target function shifts from zero to positive measure. These insights, derived from a simplified context, provide a perspective on the intricate yet decipherable complexities of perfect generalization in overparameterized neural networks.

Disentangle Sample Size and Initialization Effect on Perfect Generalization for Single-Neuron Target

TL;DR

This paper investigates how initialization scale and data sample size influence perfect generalization when recovering a single-neuron target with a two-layer network. It introduces the initial imbalance ratio and two sample-size thresholds—optimistic and separation—and provides both empirical and theoretical analyses of how these factors shape training dynamics under gradient flow. A key finding is that, under small initialization, the trajectory and convergence point are determined by the normalized initial imbalance vector , with recovery behavior exhibiting phase transitions at the identified sample-size thresholds. The results extend to multi-neuron networks, where only a subset of neurons remains active, offering insights into generalization in overparameterized models. Overall, the work clarifies how initialization and data availability interact to yield perfect generalization in a simplified setting, providing a stepping stone toward understanding more complex target functions.

Abstract

Overparameterized models like deep neural networks have the intriguing ability to recover target functions with fewer sampled data points than parameters (see arXiv:2307.08921). To gain insights into this phenomenon, we concentrate on a single-neuron target recovery scenario, offering a systematic examination of how initialization and sample size influence the performance of two-layer neural networks. Our experiments reveal that a smaller initialization scale is associated with improved generalization, and we identify a critical quantity called the "initial imbalance ratio" that governs training dynamics and generalization under small initialization, supported by theoretical proofs. Additionally, we empirically delineate two critical thresholds in sample size--termed the "optimistic sample size" and the "separation sample size"--that align with the theoretical frameworks established by (see arXiv:2307.08921 and arXiv:2309.00508). Our results indicate a transition in the model's ability to recover the target function: below the optimistic sample size, recovery is unattainable; at the optimistic sample size, recovery becomes attainable albeit with a set of initialization of zero measure. Upon reaching the separation sample size, the set of initialization that can successfully recover the target function shifts from zero to positive measure. These insights, derived from a simplified context, provide a perspective on the intricate yet decipherable complexities of perfect generalization in overparameterized neural networks.
Paper Structure (19 sections, 9 theorems, 47 equations, 11 figures, 1 table)

This paper contains 19 sections, 9 theorems, 47 equations, 11 figures, 1 table.

Key Result

Theorem 1

Consider the gradient flow governed by the differential equation where $\ell(\bm{\theta}) = \frac{1}{2}\sum_{i=1}^n (f_{\bm{\theta}}(\bm{x}_i) - y_i)^2$ for $(\bm{x}_i,y_i) \in \mathbb{R}^d \times \mathbb{R}$, with the model $f_{\bm{\theta}}(\bm{x}) = \sum_{k=1}^m a_k\sigma(\bm{w}_k^\top \bm{x})$, and the parameter vector $\bm{\theta} = (a_1, \bm{w}_1, \ldots, a_ Under these assumptions, the foll

Figures (11)

  • Figure 1: The network and target function correspond to Example \ref{['example1']}. Here, $n$ represents the sample size. For Figures \ref{['n=2']} through \ref{['n=6']}, samples were evenly spaced on the interval $[-2, 2]$. In Figure \ref{['scale2']}, the dataset $\{(x_i,y_i)\}_{i=1}^n$ is such that $y_i=f^*(x_i)$, with the $\{x_i\}_{i=1}^n$ being independently and identically distributed according to a standard Gaussian distribution. For each combination of initialization scale and sample size, we conducted $50$ trials with different seeds to generate data points and parameter initializations. The reported generalization error is the average over these trials. Curve legends indicate the initialization scale.
  • Figure 2: The network and target function correspond to Example \ref{['example1']}. We trained the network across five trials, each utilizing an evenly spaced $6$ data points within the interval $[-2,2]$. Distinct initialization seeds and scales were used for each trial, but by scaling the initial parameters of the second neuron, we keep $c=0.5$ across all trials. To align the curves, we applied translations based on distances calculated by Theorem \ref{['theorem1']}.
  • Figure 3: The network and target function correspond to Example \ref{['example1']} with a sample size of $6$ and an initialization scale of $10^{-8}$. We utilized $400$ random seeds to initialize the parameters. Figure \ref{['plot Q n=6']} shows the convergence points and the structures of $Q^1$ and $Q^2$, along with the origin and two exemplary training trajectories. The dashed line is $Q^1$ and the affine surface is $Q^2$. Figure \ref{['c and Q1 Q2']} presents the convergence results using seeds $0$-$400$, where blue and orange represent convergence to $Q^1$ and $Q^2$, respectively. The x-axis denotes the seed index, and the y-axis measures the absolute value of the ratio $C_{1}/C_{2}$. The two black horizontal lines mark the ratios at $y=1.35$ and $y=0.74$.
  • Figure 4: The network $f_{\bm{\theta}}(x)$ and target function $f^*$ correspond to Example \ref{['example1']}. The samples $\{(x_i,y_i)\}_{i=1}^n$, where $y_i=f^*(x_i)$, is obtained by drawing $\{x_i\}_{i=1}^n$ from a standard Gaussian distribution. Five random seeds were used to generate the samples. Generalization errors below $10^{-8}$ are considered as successful recovery and identified with $10^{-8}$. Figure \ref{['c and genloss, recover at Q^2']} depicts the convergence point for $n=4$ with samples generated by seed $3$. The dashed line is $Q^1$ and the affine surface is $Q^2$. All experiments were initialized with a scale of $10^{-20}$.
  • Figure 5: A two-layer neural network with a width of $1000$ and activation function $\sigma(x)=\frac{x}{1+x^2}$ is trained on $6$ evenly spaced data points in the interval $[-2,2]$ with labels given by $y=\tanh(x+1)$. Four trials with varying initialization seeds and scales were conducted. The ratio of initial parameters $C_{i}/C_{1}$ is set to $1.5+0.0015(i-1)$ for each neuron $i=1,2,\ldots,1000$ in all trials. For visualization, curves in Figure \ref{['multi neuron, c and loss']} are translated based on distances derived from Theorem \ref{['theorem1']}. Figure \ref{['multi neuron, c and para']} shows the parameter trajectories for the first two neurons.
  • ...and 6 more figures

Theorems & Definitions (21)

  • Example 1
  • Definition 1: Separation of $Q^k$
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • proof : Formal Proof of Theorem\ref{['theorem3']}:
  • Corollary 1
  • proof
  • Corollary 2
  • proof
  • ...and 11 more