Table of Contents
Fetching ...

Robust Regression with Adaptive Contamination in Response: Optimal Rates and Computational Barriers

Ilias Diakonikolas, Chao Gao, Daniel M. Kane, Ankit Pensia, Dong Xie

Abstract

We study robust regression under a contamination model in which covariates are clean while the responses may be corrupted in an adaptive manner. Unlike the classical Huber's contamination model, where both covariates and responses may be contaminated and consistent estimation is impossible when the contamination proportion is a non-vanishing constant, it turns out that the clean-covariate setting admits strictly improved statistical guarantees. Specifically, we show that the additional information in the clean covariates can be carefully exploited to construct an estimator that achieves a better estimation rate than that attainable under Huber contamination. In contrast to the Huber model, this improved rate implies consistency even when the contamination is a constant. A matching minimax lower bound is established using Fano's inequality together with the construction of contamination processes that match $m> 2$ distributions simultaneously, extending the previous two-point lower bound argument in Huber's setting. Despite the improvement over the Huber model from an information-theoretic perspective, we provide formal evidence -- in the form of Statistical Query and Low-Degree Polynomial lower bounds -- that the problem exhibits strong information-computation gaps. Our results strongly suggest that the information-theoretic improvements cannot be achieved by polynomial-time algorithms, revealing a fundamental gap between information-theoretic and computational limits in robust regression with clean covariates.

Robust Regression with Adaptive Contamination in Response: Optimal Rates and Computational Barriers

Abstract

We study robust regression under a contamination model in which covariates are clean while the responses may be corrupted in an adaptive manner. Unlike the classical Huber's contamination model, where both covariates and responses may be contaminated and consistent estimation is impossible when the contamination proportion is a non-vanishing constant, it turns out that the clean-covariate setting admits strictly improved statistical guarantees. Specifically, we show that the additional information in the clean covariates can be carefully exploited to construct an estimator that achieves a better estimation rate than that attainable under Huber contamination. In contrast to the Huber model, this improved rate implies consistency even when the contamination is a constant. A matching minimax lower bound is established using Fano's inequality together with the construction of contamination processes that match distributions simultaneously, extending the previous two-point lower bound argument in Huber's setting. Despite the improvement over the Huber model from an information-theoretic perspective, we provide formal evidence -- in the form of Statistical Query and Low-Degree Polynomial lower bounds -- that the problem exhibits strong information-computation gaps. Our results strongly suggest that the information-theoretic improvements cannot be achieved by polynomial-time algorithms, revealing a fundamental gap between information-theoretic and computational limits in robust regression with clean covariates.

Paper Structure

This paper contains 37 sections, 19 theorems, 108 equations, 1 figure.

Key Result

Theorem 2.1

Consider data generated from model:adaptive with ${p}=1$ and the estimator (eq:t-med-reg-1d) for some $t\in[0,\sqrt{0.9\log n}]$. For any $\alpha\in(0,1)$, there exist $C,c>0$ such that whenever $\frac{1}{\sqrt{n}}+\epsilon \leq c$, the estimator (eq:t-med-reg-1d) satisfies with probability at least $1-\alpha$. Thus, by taking $t=\sqrt{\frac{1}{2}\log(n\epsilon^2+e)}$, we achieve the error rate $

Figures (1)

  • Figure 1: Comparison of different contamination models. An arrow from Model A to Model B indicates that the former is a weaker contamination model. A green shade indicates that the model permits consistency for any fixed $\epsilon$ that is sufficiently small (say $\epsilon<1/5$), while the red shade indicates that consistency is not possible for any fixed $\epsilon$. A dashed oval indicates that the covariates are clean, i.e., $X \sim \mathcal{N}(0,I_p)$. See \ref{['prop:different-models']} for a precise statement.

Theorems & Definitions (40)

  • Theorem 2.1
  • Theorem 3.1
  • Remark 3.2
  • Lemma 3.3
  • Lemma 3.4
  • Corollary 3.5
  • proof
  • Theorem 3.6: Information-theoretic Lower Bound
  • Theorem 4.1
  • Definition 4.2: STAT Oracle
  • ...and 30 more