Table of Contents
Fetching ...

Hardness of Learning Fixed Parities with Neural Networks

Itamar Shoshani, Ohad Shamir

TL;DR

The paper addresses why learning fixed parity functions remains hard for gradient-based methods, despite being theoretically learnable with small samples. By proving a novel exponential decay bound on Fourier coefficients for linear-threshold functions and connecting this to the gradients encountered by perturbed gradient descent, it shows that standard training of one-hidden-layer ReLU networks fails to meaningfully reduce the parity-learning objective for parity sets of size |S|, including the full parity. A complementary single-neuron result under squared loss exhibits the same hardness, tying weak learnability to algorithmic dynamics rather than expressivity. Together, these results illuminate the practical limits of gradient-based learning for parity tasks and open questions about extending the analysis to SGD, other architectures, and non-spherical weight distributions.

Abstract

Learning parity functions is a canonical problem in learning theory, which although computationally tractable, is not amenable to standard learning algorithms such as gradient-based methods. This hardness is usually explained via statistical query lower bounds [Kearns, 1998]. However, these bounds only imply that for any given algorithm, there is some worst-case parity function that will be hard to learn. Thus, they do not explain why fixed parities - say, the full parity function over all coordinates - are difficult to learn in practice, at least with standard predictors and gradient-based methods [Abbe and Boix-Adsera, 2022]. In this paper, we address this open problem, by showing that for any fixed parity of some minimal size, using it as a target function to train one-hidden-layer ReLU networks with perturbed gradient descent will fail to produce anything meaningful. To establish this, we prove a new result about the decay of the Fourier coefficients of linear threshold (or weighted majority) functions, which may be of independent interest.

Hardness of Learning Fixed Parities with Neural Networks

TL;DR

The paper addresses why learning fixed parity functions remains hard for gradient-based methods, despite being theoretically learnable with small samples. By proving a novel exponential decay bound on Fourier coefficients for linear-threshold functions and connecting this to the gradients encountered by perturbed gradient descent, it shows that standard training of one-hidden-layer ReLU networks fails to meaningfully reduce the parity-learning objective for parity sets of size |S|, including the full parity. A complementary single-neuron result under squared loss exhibits the same hardness, tying weak learnability to algorithmic dynamics rather than expressivity. Together, these results illuminate the practical limits of gradient-based learning for parity tasks and open questions about extending the analysis to SGD, other architectures, and non-spherical weight distributions.

Abstract

Learning parity functions is a canonical problem in learning theory, which although computationally tractable, is not amenable to standard learning algorithms such as gradient-based methods. This hardness is usually explained via statistical query lower bounds [Kearns, 1998]. However, these bounds only imply that for any given algorithm, there is some worst-case parity function that will be hard to learn. Thus, they do not explain why fixed parities - say, the full parity function over all coordinates - are difficult to learn in practice, at least with standard predictors and gradient-based methods [Abbe and Boix-Adsera, 2022]. In this paper, we address this open problem, by showing that for any fixed parity of some minimal size, using it as a target function to train one-hidden-layer ReLU networks with perturbed gradient descent will fail to produce anything meaningful. To establish this, we prove a new result about the decay of the Fourier coefficients of linear threshold (or weighted majority) functions, which may be of independent interest.
Paper Structure (23 sections, 11 theorems, 115 equations)

This paper contains 23 sections, 11 theorems, 115 equations.

Key Result

Theorem 1

For any $S\subseteq[d]$, there exists a width $n=|S|+1$ network $N_{\theta}(x)$ as above, such that $\left\|\theta\right\|\leq 6|S|^\frac{3}{2}$ and $N_{\theta}(x)=p_S(x)$ for all $x\in \{\pm 1\}^d$. Thus, for this $\theta$, $F_S(\theta) = -1$.

Theorems & Definitions (14)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • proof
  • ...and 4 more