Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals
Anxin Guo, Aravindan Vijayaraghavan
TL;DR
The work resolves whether an arbitrarily biased ReLU under Gaussian marginals can be learned in the agnostic setting within polynomial time. It introduces a two-stage SQ-based algorithm—thresholded PCA to obtain a warm start and a reweighted PGD procedure to refine toward $\mathrm{OPT}$—and proves a constant-factor approximation $L(\hat{w},\hat{b}) \le \alpha\,\mathrm{OPT} + \varepsilon$ in time poly$(d,1/\varepsilon)$, while also establishing a CSQ hardness barrier. This yields a sharp SQ-CSQ separation for learning a single neuron under Gaussian inputs, highlighting limitations of gradient-descent-based, correlational approaches in the arbitrary-bias regime. The results shed light on the intrinsic computational complexity of agnostic ReLU regression and point to hybrid SQ/algorithmic techniques as a viable path for simple, yet challenging, neural components with broad implications for single-index model learning.
Abstract
We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting. Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms.
