Learning High-Degree Parities: The Crucial Role of the Initialization
Emmanuel Abbe, Elisabetta Cornacchia, Jan Hązła, Donald Kougang-Yombi
TL;DR
The paper investigates how initialization shapes the learnability of high-degree parity functions under gradient-based training. It proves a sharp separation: two-layer ReLU networks with Rademacher initialization can efficiently learn almost-full parities (including the full parity) using GD/SGD with correlation or hinge loss, while Gaussian initialization with constant variance impedes learning due to small initial gradient alignment. The authors introduce Gradient Alignment (GAL) as a loss-dependent measure of how aligned initialization is with the target and show it governs learnability beyond standard complexity notions. Experiments corroborate the theory, showing a robust special role for Rademacher initialization and a threshold-like behavior with perturbations of increasing magnitude. Overall, the work highlights initialization as a decisive factor in neural-network learning of structurally hard targets and raises questions about threshold phenomena and broader applicability of GAL-based hardness criteria.
Abstract
Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree $k$ parities on uniform inputs for constant $k$, but fail to do so when $k$ and $d-k$ grow with $d$ (here $d$ is the ambient dimension). However, the case where $k=d-O_d(1)$ (almost-full parities), including the degree $d$ parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation $σ$ prevents it. The positive result for almost-full parities is shown to hold up to $σ=O(d^{-1})$, pointing to questions about a sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a singleton function class like the full parity is trivially learnable, our negative result applies to a fixed function and relies on an initial gradient alignment measure of potential broader relevance to neural networks learning.
