Table of Contents
Fetching ...

Asymptotic Analysis of Two-Layer Neural Networks after One Gradient Step under Gaussian Mixtures Data with Structure

Samet Demir, Zafer Dogan

TL;DR

The work investigates two-layer neural networks trained with a single gradient step on the first layer under structured Gaussian mixture data in a proportional asymptotic regime. By decomposing the gradient into spike and bulk components and analyzing a structure-plus-bulk interaction in the feature space, the authors establish a conditional Gaussian equivalence and further show that a finite-degree Hermite polynomial activation can replicate the training and generalization performance of the nonlinear network. This Hermite-model equivalence provides a tractable framework to study feature learning under mixtures, linking data-spread and learning-rate scalings via the strength parameter $eta$ and weighting parameter $oldsymbol{ extalpha}$. The results are validated through extensive simulations, including Fashion-MNIST experiments with GAN-generated data, demonstrating the practical relevance of the theory for realistic structured data. Overall, the paper advances understanding of how data structure and one-step feature learning govern generalization in two-layer networks, with implications for activation-function design and analysis under non-iid, high-dimensional mixtures.

Abstract

In this work, we study the training and generalization performance of two-layer neural networks (NNs) after one gradient descent step under structured data modeled by Gaussian mixtures. While previous research has extensively analyzed this model under isotropic data assumption, such simplifications overlook the complexities inherent in real-world datasets. Our work addresses this limitation by analyzing two-layer NNs under Gaussian mixture data assumption in the asymptotically proportional limit, where the input dimension, number of hidden neurons, and sample size grow with finite ratios. We characterize the training and generalization errors by leveraging recent advancements in Gaussian universality. Specifically, we prove that a high-order polynomial model performs equivalent to the nonlinear neural networks under certain conditions. The degree of the equivalent model is intricately linked to both the "data spread" and the learning rate employed during one gradient step. Through extensive simulations, we demonstrate the equivalence between the original model and its polynomial counterpart across various regression and classification tasks. Additionally, we explore how different properties of Gaussian mixtures affect learning outcomes. Finally, we illustrate experimental results on Fashion-MNIST classification, indicating that our findings can translate to realistic data.

Asymptotic Analysis of Two-Layer Neural Networks after One Gradient Step under Gaussian Mixtures Data with Structure

TL;DR

The work investigates two-layer neural networks trained with a single gradient step on the first layer under structured Gaussian mixture data in a proportional asymptotic regime. By decomposing the gradient into spike and bulk components and analyzing a structure-plus-bulk interaction in the feature space, the authors establish a conditional Gaussian equivalence and further show that a finite-degree Hermite polynomial activation can replicate the training and generalization performance of the nonlinear network. This Hermite-model equivalence provides a tractable framework to study feature learning under mixtures, linking data-spread and learning-rate scalings via the strength parameter and weighting parameter . The results are validated through extensive simulations, including Fashion-MNIST experiments with GAN-generated data, demonstrating the practical relevance of the theory for realistic structured data. Overall, the paper advances understanding of how data structure and one-step feature learning govern generalization in two-layer networks, with implications for activation-function design and analysis under non-iid, high-dimensional mixtures.

Abstract

In this work, we study the training and generalization performance of two-layer neural networks (NNs) after one gradient descent step under structured data modeled by Gaussian mixtures. While previous research has extensively analyzed this model under isotropic data assumption, such simplifications overlook the complexities inherent in real-world datasets. Our work addresses this limitation by analyzing two-layer NNs under Gaussian mixture data assumption in the asymptotically proportional limit, where the input dimension, number of hidden neurons, and sample size grow with finite ratios. We characterize the training and generalization errors by leveraging recent advancements in Gaussian universality. Specifically, we prove that a high-order polynomial model performs equivalent to the nonlinear neural networks under certain conditions. The degree of the equivalent model is intricately linked to both the "data spread" and the learning rate employed during one gradient step. Through extensive simulations, we demonstrate the equivalence between the original model and its polynomial counterpart across various regression and classification tasks. Additionally, we explore how different properties of Gaussian mixtures affect learning outcomes. Finally, we illustrate experimental results on Fashion-MNIST classification, indicating that our findings can translate to realistic data.

Paper Structure

This paper contains 33 sections, 10 theorems, 52 equations, 9 figures.

Key Result

Lemma 1

Consider the gradient ${\bm{G}}$ defined in (eq:gradient_definition). It admits the following decomposition where ${\bm{u}} := \Tilde{h}_1 {\bm{w}}$ and ${\bm{v}} := \Tilde{{\bm{X}}}^T \Tilde{{\bm{y}}} / (m\sqrt{k})$, where $\Tilde{h}_1 := \mathbb{E}_{z \sim {\mathcal{N}}(0,1)}[\sigma^\prime(z)]$. Also, $\|{\bm{u}}\| = {\Tilde{{\mathcal{O}}}}(1)$, $\|{\bm{v}}\| = {\Tilde{{\mathcal{O}}}}\left(k^{-

Figures (9)

  • Figure 1: Generalization error comparison between neural network and the Hermite model. We set both the input dimension and the number of samples to $n = m = 1000$, with two Gaussian components ($\mathcal{C} = 2$) and covariance matrix ranks of $d_1 = d_2 = 1$. The mixture ratio for both components is set to $\rho_1 = \rho_2 = 0.5$, and a regularization constant of $\lambda = 1e-4$ is applied. For the labels, we utilize $y = \text{ReLU}({\boldsymbol{\xi}}^T {\bm{x}})$, and we limit the maximum degree of the Hermite polynomial to $l = 5$ for numerical stability. The figure presents averages from 20 Monte Carlo simulations.
  • Figure 2: Impacts of properties of the Gaussian mixture data model on generalization performance. Here, we set the number of Gaussian components to $\mathcal{C} = 2$, with equal input dimensions and sample sizes of $m = n = k = 1000$. The parameters are configured with $\beta = 3/4^-$, $\alpha = 1/2$, $l=4$, and a regularization constant of $\lambda = 1e-4$. For (a) and (b), the eigenvalues of the covariance matrix (\ref{['eq:sigma_decomposition']}) for each Gaussian component are fixed at $\theta_{1,1} = \theta_{2,1} = n^\beta$, while in (c), the eigenvalues $\{\theta_{c,i}\}_{i=1}^{d_c}$ are sampled uniformly from the interval $(0,n^\beta)$. The results displayed are averages from 20 Monte Carlo simulations, with data resampled for each run.
  • Figure 3: Simulation results on Fashion-MNIST binary classification for $\|{\boldsymbol{\Sigma}}\| = n$ and $\eta = 1$. The data is generated from a conditional GAN trained on Fashion-MNIST dataset and pre-processed. For the pre-processing, the inputs from each class are demeaned, re-scaled and added noise such that assumptions (A.2)-(A.4) are satisfied. $m = 500$, $\lambda = 1e-4$ and $l=5$. Details for the simulations and examples of input images after the pre-processing are provided in Appendix \ref{['appendix:fashion_mnist_experiment']}.
  • Figure 4: Generalization performance of the neural network with respect to $\alpha$ and $\beta$ as heat maps. We set $n = 400, m = 500$, and used the ReLU activation function ($\sigma = \sigma_* = \text{ReLU}$). The number of classes is $\mathcal{C} = 2$ with dimensions $d_1 = d_2 = 1$. The parameters $\theta_{1,1} = \theta_{2,1} = n^{\beta (1 - \alpha)}$ and $\lambda = 1e-4$ are employed to control the model's behavior. ${\boldsymbol{\xi}} = ({\boldsymbol{\gamma}}_{1,1} + {\boldsymbol{\gamma}}_{2,1}) / (\|{\boldsymbol{\gamma}}_{1,1} + {\boldsymbol{\gamma}}_{2,1}\| \|{\boldsymbol{\Sigma}}^{1/2}\|)$ is used to ensure high alignment between ${\boldsymbol{\xi}}$ and the data covariance. The results presented are averages from 20 Monte Carlo simulations.
  • Figure 5: Impact of $\beta$ for various $\alpha$ values in the setting of Figure \ref{['fig:beta']}.
  • ...and 4 more figures

Theorems & Definitions (20)

  • Lemma 1: Spike+bulk decomposition of the gradient
  • proof
  • Lemma 2: Structure+bulk decomposition of $\hat{{\bm{F}}} {\bm{x}}$
  • proof
  • Theorem 3: Conditional Gaussian equivalence
  • proof
  • Theorem 4
  • proof
  • Lemma 5: Conditional CLT
  • proof
  • ...and 10 more