Table of Contents
Fetching ...

Robustly Learning Single-Index Models via Alignment Sharpness

Nikos Zarifis, Puqian Wang, Ilias Diakonikolas, Jelena Diakonikolas

TL;DR

The paper tackles the problem of learning single-index models under the squared loss in the agnostic setting with unknown link functions. It introduces alignment sharpness, a local-error-bound notion for a convex surrogate loss, and develops a computationally efficient algorithm that achieves a universal constant-factor approximation to the best possible $L_2^2$ loss. The key ideas are to select best-fit activations along a projected direction and to leverage a gradient-alignment guarantee that contracts misalignment between the estimated and true directions, enabling a linear-rate convergence. The results hold under mild distributional assumptions (the well-behaved class) and for broad activation families $igl( ext{a}, ext{b}igr)$-unbounded, including ReLU-like functions, providing the first polynomial-time constant-factor agnostic learner for Gaussian marginals and unknown link functions. The work thus advances practical agnostic learning for SIMs and suggests broader applicability of alignment-based analysis in optimization.

Abstract

We study the problem of learning Single-Index Models under the $L_2^2$ loss in the agnostic model. We give an efficient learning algorithm, achieving a constant factor approximation to the optimal loss, that succeeds under a range of distributions (including log-concave distributions) and a broad class of monotone and Lipschitz link functions. This is the first efficient constant factor approximate agnostic learner, even for Gaussian data and for any nontrivial class of link functions. Prior work for the case of unknown link function either works in the realizable setting or does not attain constant factor approximation. The main technical ingredient enabling our algorithm and analysis is a novel notion of a local error bound in optimization that we term alignment sharpness and that may be of broader interest.

Robustly Learning Single-Index Models via Alignment Sharpness

TL;DR

The paper tackles the problem of learning single-index models under the squared loss in the agnostic setting with unknown link functions. It introduces alignment sharpness, a local-error-bound notion for a convex surrogate loss, and develops a computationally efficient algorithm that achieves a universal constant-factor approximation to the best possible loss. The key ideas are to select best-fit activations along a projected direction and to leverage a gradient-alignment guarantee that contracts misalignment between the estimated and true directions, enabling a linear-rate convergence. The results hold under mild distributional assumptions (the well-behaved class) and for broad activation families -unbounded, including ReLU-like functions, providing the first polynomial-time constant-factor agnostic learner for Gaussian marginals and unknown link functions. The work thus advances practical agnostic learning for SIMs and suggests broader applicability of alignment-based analysis in optimization.

Abstract

We study the problem of learning Single-Index Models under the loss in the agnostic model. We give an efficient learning algorithm, achieving a constant factor approximation to the optimal loss, that succeeds under a range of distributions (including log-concave distributions) and a broad class of monotone and Lipschitz link functions. This is the first efficient constant factor approximate agnostic learner, even for Gaussian data and for any nontrivial class of link functions. Prior work for the case of unknown link function either works in the realizable setting or does not attain constant factor approximation. The main technical ingredient enabling our algorithm and analysis is a novel notion of a local error bound in optimization that we term alignment sharpness and that may be of broader interest.
Paper Structure (38 sections, 18 theorems, 251 equations, 3 figures, 4 algorithms)

This paper contains 38 sections, 18 theorems, 251 equations, 3 figures, 4 algorithms.

Key Result

Theorem 1.4

Given def:agnostic-learning, where $\mathcal{G}$ is the class of $(L, R)$-well behaved distributions with $L, R = O(1)$ and $\mathcal{F} = \mathcal{U}_{(a,b)}$ such that $(1/a), b = O(1)$, there is an algorithm that draws $N = \mathrm{poly}(W) \tilde{O}(d/\epsilon^{2})$ samples from $\mathcal{D}$, r

Figures (3)

  • Figure 1: Under the assumption that $\tilde{\mathbf{v}}\cdot\mathbf{x}\in(R/16,R/8)$, and $I_1(\mathbf{x})\geq 0, I_2(\mathbf{x})\geq 0$, the distance between $f(\mathbf{w}\cdot\mathbf{x})$ and $u^*(\mathbf{w}^*\cdot\mathbf{x})$ is at least $|u^*(\alpha \mathbf{w}\cdot \mathbf{x}+\|\mathbf{v}\|_2R/4)-u^*(\mathbf{w}^{\ast}\cdot\mathbf{x})|\geq a\|\mathbf{v}\|_2R/8$.
  • Figure 2: On the 2-dimensional space $V$ spanned by $(\mathbf{x}_{\mathbf{v}},\mathbf{x}_{\mathbf{w}})$, at each point $\mathbf{x}\in B\cup B'$, it must be that $I_1(\mathbf{x})I_2(\mathbf{x})\geq 0$ or $I_2(\mathbf{x})I_3(\mathbf{x})\geq 0$. $\Gamma_1$ denotes the interval of $\mathbf{x}_\mathbf{w} = \mathbf{w}\cdot\mathbf{x}$ such that $f(\mathbf{w}\cdot\mathbf{x})\geq u^*(\alpha\mathbf{w}\cdot\mathbf{x} + \|\mathbf{v}\|_2R)$, hence both $I_1(\mathbf{x})I_2(\mathbf{x})\geq 0,\, I_2(\mathbf{x})I_3(\mathbf{x})\geq 0$; $\Gamma_2$ denotes the interval of $\mathbf{x}_\mathbf{w}$ such that $f(\mathbf{w}\cdot\mathbf{x})\in (u^*(\alpha\mathbf{w}\cdot\mathbf{x} + \|\mathbf{v}\|_2R/32), u^*(\alpha\mathbf{w}\cdot\mathbf{x} + \|\mathbf{v}\|_2R/4))$, hence $I_2(\mathbf{x})I_3(\mathbf{x})\geq 0$; finally, $\Gamma_3$ denotes the interval of $\mathbf{x}_\mathbf{w}$ such that $f(\mathbf{w}\cdot\mathbf{x})\in (u^*(\alpha\mathbf{w}\cdot\mathbf{x} + \|\mathbf{v}\|_2R/4), u^*(\alpha\mathbf{w}\cdot\mathbf{x} + \|\mathbf{v}\|_2R/))$, hence $I_1(\mathbf{x})I_2(\mathbf{x})\geq 0$. The area of the union of the red and blue regions is the lower bound on the probability in \ref{['ineq:lower-bound-prob-mass']}. As displayed in the figure, the sum of the blue and red region is lower bounded by $\mathds{1}\{\mathbf{x}\in B\} + (\mathds{1}\{\mathbf{x}\in B'\} - \mathds{1}\{\mathbf{x}\in B\}) \mathds{1}\{I_2(\mathbf{x})I_3(\mathbf{x})\geq 0\}$.
  • Figure 3: An illustration of $\hat{f}$ for $u^*(z) = \max\{0,z\}$ and a dataset $S^* =\{(\mathbf{x}^{(1)},u^*(\mathbf{w}^*\cdot\mathbf{x}^{(1)})),\dots,(\mathbf{x}^{(6)},u^*(\mathbf{w}^*\cdot\mathbf{x}^{(6)}))\}$ where $\mathbf{w}^*\cdot\mathbf{x}^{(1)}<\mathbf{w}^*\cdot\mathbf{x}^{(2)}<\mathbf{w}^*\cdot\mathbf{x}^{(3)}<0$.

Theorems & Definitions (56)

  • Definition 1.2: Well-Behaved Distributions
  • Definition 1.3: Unbounded Activations
  • Theorem 1.4: Main Algorithmic Result, Informal
  • Example 1.5
  • Proposition 3.0: Alignment Sharpness of the Convex Surrogate
  • Lemma 3.0: Lower Bound on $L_2^2$ Error by Misalignment
  • proof
  • Lemma 3.0: Closeness of Population-Optimal Activations
  • Corollary 3.0: Closeness of Idealized and Attainable Activations
  • proof : Proof of \ref{['main:thm:sharpness']}
  • ...and 46 more