Table of Contents
Fetching ...

Information-Computation Tradeoffs for Noiseless Linear Regression with Oblivious Contamination

Ilias Diakonikolas, Chao Gao, Daniel M. Kane, John Lafferty, Ankit Pensia

TL;DR

This work establishes a fundamental information-computation tradeoff for noiseless linear regression with Gaussian covariates under oblivious contamination in the responses. By embedding the problem into a testing task and constructing a careful contamination model using a discrete Gaussian distribution, the authors reduce to a conditional NGCA framework and apply Gaussian Fourier analysis to prove SQ lower bounds. The main contribution is a formal lower bound showing that any efficient SQ algorithm requires a simulation complexity of at least $\tilde{\Omega}(\sqrt{d}/\alpha^2)$, implying a quadratic dependence on $1/\alpha$ that cannot be avoided by SQ methods. The results illuminate intrinsic computational barriers in robust, high-dimensional regression under weak contamination, while leaving open the precise dependence on dimension $d$ for the optimal computational-sample complexity and inviting further exploration of the low-degree polynomial regime.

Abstract

We study the task of noiseless linear regression under Gaussian covariates in the presence of additive oblivious contamination. Specifically, we are given i.i.d.\ samples from a distribution $(x, y)$ on $\mathbb{R}^d \times \mathbb{R}$ with $x \sim \mathcal{N}(0,\mathbf{I}_d)$ and $y = x^\top β+ z$, where $z$ is drawn independently of $x$ from an unknown distribution $E$. Moreover, $z$ satisfies $\mathbb{P}_E[z = 0] = α>0$. The goal is to accurately recover the regressor $β$ to small $\ell_2$-error. Ignoring computational considerations, this problem is known to be solvable using $O(d/α)$ samples. On the other hand, the best known polynomial-time algorithms require $Ω(d/α^2)$ samples. Here we provide formal evidence that the quadratic dependence in $1/α$ is inherent for efficient algorithms. Specifically, we show that any efficient Statistical Query algorithm for this task requires VSTAT complexity at least $\tildeΩ(d^{1/2}/α^2)$.

Information-Computation Tradeoffs for Noiseless Linear Regression with Oblivious Contamination

TL;DR

This work establishes a fundamental information-computation tradeoff for noiseless linear regression with Gaussian covariates under oblivious contamination in the responses. By embedding the problem into a testing task and constructing a careful contamination model using a discrete Gaussian distribution, the authors reduce to a conditional NGCA framework and apply Gaussian Fourier analysis to prove SQ lower bounds. The main contribution is a formal lower bound showing that any efficient SQ algorithm requires a simulation complexity of at least , implying a quadratic dependence on that cannot be avoided by SQ methods. The results illuminate intrinsic computational barriers in robust, high-dimensional regression under weak contamination, while leaving open the precise dependence on dimension for the optimal computational-sample complexity and inviting further exploration of the low-degree polynomial regime.

Abstract

We study the task of noiseless linear regression under Gaussian covariates in the presence of additive oblivious contamination. Specifically, we are given i.i.d.\ samples from a distribution on with and , where is drawn independently of from an unknown distribution . Moreover, satisfies . The goal is to accurately recover the regressor to small -error. Ignoring computational considerations, this problem is known to be solvable using samples. On the other hand, the best known polynomial-time algorithms require samples. Here we provide formal evidence that the quadratic dependence in is inherent for efficient algorithms. Specifically, we show that any efficient Statistical Query algorithm for this task requires VSTAT complexity at least .

Paper Structure

This paper contains 46 sections, 22 theorems, 63 equations.

Key Result

Proposition 1.4

If there exists a computationally-efficient algorithm to compute $\widehat{\beta}$ with $\|\widehat{\beta}-\beta^*\|\leq \rho/4$ with high probability, then it can be transformed into a computationally-efficient algorithm for def:lin-regr-oblivious.

Theorems & Definitions (52)

  • Definition 1.1: Noiseless Linear Regression with Oblivious Contamination in Responses
  • Proposition 1.4: Efficient Reduction of Testing to Estimation; Informal
  • Definition 1.4: VSTAT Oracle
  • Theorem 1.5: SQ Hardness of \ref{['def:lin-regr-oblivious']}; informal
  • Definition 2.1: Pairwise Correlation
  • Definition 2.1: Statistical dimension from BreBHLS21
  • Definition 2.1: Success of a query on a distribution
  • Proposition 2.2: Generic SQ Lower Bound
  • Definition 2.3: High-Dimensional Hidden Direction Distribution
  • Definition 2.5: Discrete Gaussian Distributions
  • ...and 42 more