Table of Contents
Fetching ...

Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals

Anxin Guo, Aravindan Vijayaraghavan

TL;DR

The work resolves whether an arbitrarily biased ReLU under Gaussian marginals can be learned in the agnostic setting within polynomial time. It introduces a two-stage SQ-based algorithm—thresholded PCA to obtain a warm start and a reweighted PGD procedure to refine toward $\mathrm{OPT}$—and proves a constant-factor approximation $L(\hat{w},\hat{b}) \le \alpha\,\mathrm{OPT} + \varepsilon$ in time poly$(d,1/\varepsilon)$, while also establishing a CSQ hardness barrier. This yields a sharp SQ-CSQ separation for learning a single neuron under Gaussian inputs, highlighting limitations of gradient-descent-based, correlational approaches in the arbitrary-bias regime. The results shed light on the intrinsic computational complexity of agnostic ReLU regression and point to hybrid SQ/algorithmic techniques as a viable path for simple, yet challenging, neural components with broad implications for single-index model learning.

Abstract

We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting. Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms.

Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals

TL;DR

The work resolves whether an arbitrarily biased ReLU under Gaussian marginals can be learned in the agnostic setting within polynomial time. It introduces a two-stage SQ-based algorithm—thresholded PCA to obtain a warm start and a reweighted PGD procedure to refine toward —and proves a constant-factor approximation in time poly, while also establishing a CSQ hardness barrier. This yields a sharp SQ-CSQ separation for learning a single neuron under Gaussian inputs, highlighting limitations of gradient-descent-based, correlational approaches in the arbitrary-bias regime. The results shed light on the intrinsic computational complexity of agnostic ReLU regression and point to hybrid SQ/algorithmic techniques as a viable path for simple, yet challenging, neural components with broad implications for single-index model learning.

Abstract

We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting. Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of in time , where is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms.

Paper Structure

This paper contains 29 sections, 38 theorems, 119 equations, 2 figures.

Key Result

Theorem 1.1

There exists a constant $\alpha$, such that for all $W > 0$ the following holds. Let $\mathcal{D}$ be the joint distribution of $(x,y)\in \mathbb{R}^d\times \mathbb{R}$, where the $x$-marginal is $\mathcal{N}(0, I_d)$. Algorithm alg:full-alg uses $\mathrm{poly}(d, \frac{1}{\varepsilon}, \frac{1}{\de

Figures (2)

  • Figure 1: One-dimensional bad example in \ref{['eq:intro:1D-GD-bad-example']}. The noise (blue) decreases exponentially in $|b|$.
  • Figure 2: Plan view of the $w_t$-$v^\perp$ plane in high dimensional analysis of GD. The ReLU is positive on the red region. $1-o(1)$ probability mass of the colored region falls in the green strip.

Theorems & Definitions (59)

  • Theorem 1.1: SQ algorithm that gets $O(\mathrm{OPT})$, informal
  • Theorem 1.2: CSQ lower bound of $\omega(\mathrm{OPT})$
  • Proposition 2.1
  • Proposition 2.2
  • Lemma 2.3
  • Theorem 3.2
  • Lemma 3.3
  • Lemma 3.4
  • proof
  • Lemma 3.5
  • ...and 49 more