Table of Contents
Fetching ...

Learning Two-layer Neural Networks with Symmetric Inputs

Rong Ge, Rohith Kuditipudi, Zhize Li, Xiang Wang

TL;DR

This work tackles the theory of learning two-layer ReLU networks under symmetric input distributions, a setting where prior results struggle with nonconvex optimization. It develops a method-of-moments framework combined with tensor-decomposition techniques (inspired by FOOBI) and a distinguishing-matrix construct to identify and recover the first-layer weights W and second-layer weights A. A key innovation is reducing two-layer learning to multiple single-layer problems via a Pure Neuron Detector and a linearization of higher-order moments, enabling polynomial-time recovery under nondegeneracy and smoothed-analysis guarantees. Empirical results demonstrate strong sample efficiency and robustness to noise and conditioning across diverse symmetric inputs, highlighting the practical potential of spectral methods for structured, non-Gaussian data. The work opens avenues for broader input-distribution classes and deeper connections between optimization landscapes and identifiability in neural architectures.

Abstract

We give a new algorithm for learning a two-layer neural network under a general class of input distributions. Assuming there is a ground-truth two-layer network $$ y = A σ(Wx) + ξ, $$ where $A,W$ are weight matrices, $ξ$ represents noise, and the number of neurons in the hidden layer is no larger than the input or output, our algorithm is guaranteed to recover the parameters $A,W$ of the ground-truth network. The only requirement on the input $x$ is that it is symmetric, which still allows highly complicated and structured input. Our algorithm is based on the method-of-moments framework and extends several results in tensor decompositions. We use spectral algorithms to avoid the complicated non-convex optimization in learning neural networks. Experiments show that our algorithm can robustly learn the ground-truth neural network with a small number of samples for many symmetric input distributions.

Learning Two-layer Neural Networks with Symmetric Inputs

TL;DR

This work tackles the theory of learning two-layer ReLU networks under symmetric input distributions, a setting where prior results struggle with nonconvex optimization. It develops a method-of-moments framework combined with tensor-decomposition techniques (inspired by FOOBI) and a distinguishing-matrix construct to identify and recover the first-layer weights W and second-layer weights A. A key innovation is reducing two-layer learning to multiple single-layer problems via a Pure Neuron Detector and a linearization of higher-order moments, enabling polynomial-time recovery under nondegeneracy and smoothed-analysis guarantees. Empirical results demonstrate strong sample efficiency and robustness to noise and conditioning across diverse symmetric inputs, highlighting the practical potential of spectral methods for structured, non-Gaussian data. The work opens avenues for broader input-distribution classes and deeper connections between optimization landscapes and identifiability in neural architectures.

Abstract

We give a new algorithm for learning a two-layer neural network under a general class of input distributions. Assuming there is a ground-truth two-layer network where are weight matrices, represents noise, and the number of neurons in the hidden layer is no larger than the input or output, our algorithm is guaranteed to recover the parameters of the ground-truth network. The only requirement on the input is that it is symmetric, which still allows highly complicated and structured input. Our algorithm is based on the method-of-moments framework and extends several results in tensor decompositions. We use spectral algorithms to avoid the complicated non-convex optimization in learning neural networks. Experiments show that our algorithm can robustly learn the ground-truth neural network with a small number of samples for many symmetric input distributions.

Paper Structure

This paper contains 54 sections, 56 theorems, 229 equations, 5 figures, 4 algorithms.

Key Result

theorem 1

If the data is generated according to Equation eq:network_intro, and the input distribution $x\sim \mathcal{D}$ is symmetric. Given exact correlations between $x,y$ of order at most 4, as long as $A,W$ and input distribution are not degenerate, there is an algorithm that runs in $\hbox{poly}(d)$ tim

Figures (5)

  • Figure 1: Network model.
  • Figure 2: Error in recovering $W$, $A$ and outputs ("MSE") for different numbers of training samples and different dimensions of $W$ and $A$. Each point is the result of averaging across five trials, where on the left $W$ and $A$ are both drawn as random $10\times 10$ orthonormal matrices and in the center as $32\times 32$ orthonormal matrices. On the right, given $10,000$ training samples we plot the square root of the algorithm's error normalized by the dimension of $W$ and $A$, which are again drawn as random orthonormal matrices. The input distribution is a spherical Gaussian.
  • Figure 3: Error in recovering $W$, $A$ and outputs ("MSE") for different amounts of label noise. Each point is the result of averaging across five trials with 10,000 training samples, where for each trial $W$ and $A$ are both drawn as $10 \times 10$ orthonormal matrices. The input distribution on the left is a spherical Gaussian and on the right a mixture of two Gaussians with one component based at the all-ones vector and the other component at its reflection.
  • Figure 4: Error in recovering $W$, $A$ and outputs ("MSE"), on the left for different levels of conditioning of $W$ and on the right for $A$. Each point is the result of averaging across five trials with 20,000 training samples, where for each trial one parameter is drawn as a random orthonormal matrix while the other as described in Section \ref{['sec:exp:conditioning']}. The input distribution is a mixture of Gaussians with two components, one based at the all-ones vector and the other at its reflection.
  • Figure 5: Characterize T as the product of four matrices.

Theorems & Definitions (78)

  • theorem 1: informal
  • theorem 2: informal
  • theorem 3: informal
  • Definition 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Corollary 1: Pure Neuron Detector
  • Lemma 4
  • Lemma 5
  • ...and 68 more