Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model

Arvind Rathnashyam; Alex Gittens

Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model

Arvind Rathnashyam, Alex Gittens

TL;DR

The paper develops gradient-descent based iterative thresholding algorithms for robustly learning single-neuron models under the strong $ε$-contamination model, with corruption in both labels and covariates. It provides explicit approximation bounds for nonlinear activations (sigmoid, leaky-ReLU, ReLU) and linear regression, showing near-optimal dependence on contamination level $ε$ and noise variance $ν$, while achieving favorable sample complexities and runtime improvements. For nonlinear neurons, the main result is an $O(ν\sqrt{ε\log(1/ε)})$-type error with sample complexity $O(d/ε)$ and failure probability $e^{-Ω(d)}$; for linear regression the bound tightens to $O(νε\log(1/ε))$, with significant runtime reductions over prior work. The methods directly handle corrupted covariates using only spectral properties of the (uncorrupted) covariance, yielding practical robustness and broad applicability across activation functions. The work advances theory on robust learning with iterative thresholding beyond GLMs and sets the stage for extensions to broader neural architectures.

Abstract

We derive approximation bounds for learning single neuron models using thresholded gradient descent when both the labels and the covariates are possibly corrupted adversarially. We assume the data follows the model $y = σ(\mathbf{w}^{*} \cdot \mathbf{x}) + ξ,$ where $σ$ is a nonlinear activation function, the noise $ξ$ is Gaussian, and the covariate vector $\mathbf{x}$ is sampled from a sub-Gaussian distribution. We study sigmoidal, leaky-ReLU, and ReLU activation functions and derive a $O(ν\sqrt{ε\log(1/ε)})$ approximation bound in $\ell_{2}$-norm, with sample complexity $O(d/ε)$ and failure probability $e^{-Ω(d)}$. We also study the linear regression problem, where $σ(\mathbf{x}) = \mathbf{x}$. We derive a $O(νε\log(1/ε))$ approximation bound, improving upon the previous $O(ν)$ approximation bounds for the gradient-descent based iterative thresholding algorithms of Bhatia et al. (NeurIPS 2015) and Shen and Sanghavi (ICML 2019). Our algorithm has a $O(\textrm{polylog}(N,d)\log(R/ε))$ runtime complexity when $\|\mathbf{w}^{*}\|_2 \leq R$, improving upon the $O(\text{polylog}(N,d)/ε^2)$ runtime complexity of Awasthi et al. (NeurIPS 2022).

Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model

TL;DR

The paper develops gradient-descent based iterative thresholding algorithms for robustly learning single-neuron models under the strong

-contamination model, with corruption in both labels and covariates. It provides explicit approximation bounds for nonlinear activations (sigmoid, leaky-ReLU, ReLU) and linear regression, showing near-optimal dependence on contamination level

and noise variance

, while achieving favorable sample complexities and runtime improvements. For nonlinear neurons, the main result is an

-type error with sample complexity

and failure probability

; for linear regression the bound tightens to

, with significant runtime reductions over prior work. The methods directly handle corrupted covariates using only spectral properties of the (uncorrupted) covariance, yielding practical robustness and broad applicability across activation functions. The work advances theory on robust learning with iterative thresholding beyond GLMs and sets the stage for extensions to broader neural architectures.

Abstract

where

is a nonlinear activation function, the noise

is Gaussian, and the covariate vector

is sampled from a sub-Gaussian distribution. We study sigmoidal, leaky-ReLU, and ReLU activation functions and derive a

approximation bound in

-norm, with sample complexity

and failure probability

. We also study the linear regression problem, where

. We derive a

approximation bound, improving upon the previous

approximation bounds for the gradient-descent based iterative thresholding algorithms of Bhatia et al. (NeurIPS 2015) and Shen and Sanghavi (ICML 2019). Our algorithm has a

runtime complexity when

, improving upon the

runtime complexity of Awasthi et al. (NeurIPS 2022).

Paper Structure (29 sections, 29 theorems, 195 equations, 2 tables, 2 algorithms)

This paper contains 29 sections, 29 theorems, 195 equations, 2 tables, 2 algorithms.

Introduction
Contributions
Preliminaries
Mathematical Notation and Background
Related Work
Iterative Thresholding for gradient-based learning
Warm-up: Multivariate Linear Regression
Activation Functions
Learning Sigmoidal Neurons
Algorithmic $\epsilon$
Learning Leaky-ReLU Neurons
Learning ReLU Neurons
Discussion
Proofs for Linear Regression
Proof of \ref{['thm:one-layer-linear-nn-error']}
...and 14 more sections

Key Result

Theorem 2

Let ${X}$ be a sub-Gaussian data matrix, and ${\mathbf{y}} = {X}^{\mathop{\mathrm{\top}}\nolimits}{\mathbf{w}}^* + {\mathbf{b}}$ where ${\mathbf{b}}$ is the additive and possibly adversarial corruption. Then there exists a gradient-descent algorithm such that $\lVert{\mathbf{w}}^{(t)} - {\mathbf{w}}

Theorems & Definitions (35)

Definition 1: Strong $\epsilon$-Contamination Model
Theorem 2: Theorem 5 in bhatia2015robust
Theorem 3: Theorem 4.2 in awasthi:2022
Definition 4: Sub-Gaussian Distribution
Definition 5
Definition 6: Hard Thresholding Operator
Theorem 7
Corollary 8
Theorem 12
Theorem 13
...and 25 more

Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model

TL;DR

Abstract

Iterative thresholding for non-linear learning in the strong $\varepsilon$-contamination model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (35)