Table of Contents
Fetching ...

Learning High-Degree Parities: The Crucial Role of the Initialization

Emmanuel Abbe, Elisabetta Cornacchia, Jan Hązła, Donald Kougang-Yombi

TL;DR

The paper investigates how initialization shapes the learnability of high-degree parity functions under gradient-based training. It proves a sharp separation: two-layer ReLU networks with Rademacher initialization can efficiently learn almost-full parities (including the full parity) using GD/SGD with correlation or hinge loss, while Gaussian initialization with constant variance impedes learning due to small initial gradient alignment. The authors introduce Gradient Alignment (GAL) as a loss-dependent measure of how aligned initialization is with the target and show it governs learnability beyond standard complexity notions. Experiments corroborate the theory, showing a robust special role for Rademacher initialization and a threshold-like behavior with perturbations of increasing magnitude. Overall, the work highlights initialization as a decisive factor in neural-network learning of structurally hard targets and raises questions about threshold phenomena and broader applicability of GAL-based hardness criteria.

Abstract

Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree $k$ parities on uniform inputs for constant $k$, but fail to do so when $k$ and $d-k$ grow with $d$ (here $d$ is the ambient dimension). However, the case where $k=d-O_d(1)$ (almost-full parities), including the degree $d$ parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation $σ$ prevents it. The positive result for almost-full parities is shown to hold up to $σ=O(d^{-1})$, pointing to questions about a sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a singleton function class like the full parity is trivially learnable, our negative result applies to a fixed function and relies on an initial gradient alignment measure of potential broader relevance to neural networks learning.

Learning High-Degree Parities: The Crucial Role of the Initialization

TL;DR

The paper investigates how initialization shapes the learnability of high-degree parity functions under gradient-based training. It proves a sharp separation: two-layer ReLU networks with Rademacher initialization can efficiently learn almost-full parities (including the full parity) using GD/SGD with correlation or hinge loss, while Gaussian initialization with constant variance impedes learning due to small initial gradient alignment. The authors introduce Gradient Alignment (GAL) as a loss-dependent measure of how aligned initialization is with the target and show it governs learnability beyond standard complexity notions. Experiments corroborate the theory, showing a robust special role for Rademacher initialization and a threshold-like behavior with perturbations of increasing magnitude. Overall, the work highlights initialization as a decisive factor in neural-network learning of structurally hard targets and raises questions about threshold phenomena and broader applicability of GAL-based hardness criteria.

Abstract

Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree parities on uniform inputs for constant , but fail to do so when and grow with (here is the ambient dimension). However, the case where (almost-full parities), including the degree parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation prevents it. The positive result for almost-full parities is shown to hold up to , pointing to questions about a sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a singleton function class like the full parity is trivially learnable, our negative result applies to a fixed function and relies on an initial gradient alignment measure of potential broader relevance to neural networks learning.

Paper Structure

This paper contains 64 sections, 31 theorems, 170 equations, 9 figures.

Key Result

Theorem 1

Let $f(x) = \chi_d(x)$. A two-layer $\mathop{\mathrm{ReLU}}\nolimits$ network with some $\mathop{\mathrm{poly}}\nolimits(d)$ hidden units and $\sigma$-perturbed Rademacher initialization with $\sigma =O(d^{-1})$, trained by GD or SGD with any batch-size with the correlationThe correlation loss is de

Figures (9)

  • Figure 1: Learning the full parity with $\sigma$-perturbed initialization by SGD with the hinge loss on a $4$-layer MLP, with $d=50$, with online fresh samples (left) and with an offline fixed dataset (right).
  • Figure 2: Computing numerically the alignment $\mathop{\mathrm{GAL}}\nolimits_f$ with the hinge loss (left) and the correlation loss (right), for a one-neuron network.
  • Figure 3: Learning the full parity with perturbations of the Rad. initialization by SGD with the hinge loss on a $4$-layer MLP, with $d=50$, with online fresh samples (left) and with an offline dataset (right).
  • Figure 4: Learning 3-parity (left) and 5-parity (right) with Rademacher, $\sigma$-perturbed and Gaussian initializations, with SGD with the hinge loss on a 4-layer MLP, with $d=50$. We plot the test accuracy, for several training set sizes.
  • Figure 5: Learning the full parity with $\sigma$-perturbed initialization by SGD with the hinge loss on a $4$-layer MLP, with input dimension $d=100$ (top-left), $d=150$ (top-right) and $d=200$ (bottom), with online fresh samples.
  • ...and 4 more figures

Theorems & Definitions (66)

  • Definition 1: Perturbed Initialization
  • Theorem 1: Informal, Positive Full Parity
  • Definition 2: Gradient Alignment
  • Theorem 2: Informal, Negative General
  • Theorem 3: Informal, Negative Almost-Full Parities
  • Definition 3: Noisy-(S)GD
  • Theorem 4
  • Theorem 5
  • Corollary 1
  • Corollary 2
  • ...and 56 more