Table of Contents
Fetching ...

Confidence Intervals for Linear Models with Arbitrary Noise Contamination

Dong Xie, Chao Gao, John Lafferty

TL;DR

This work develops a robust, adaptive confidence interval for a regression coordinate under Huber-type contamination with unknown contamination level $\epsilon$. By framing CI construction as a Z-estimation problem with a smooth estimating function and a decorrelation step, the method achieves uniform coverage over all contamination distributions and attains the optimal length $O\left(\frac{1}{\sqrt{n(1-\epsilon)^2}}\right)$, matching the rate when $\epsilon$ is known. The approach extends from a univariate intuition to a multivariate regression setting, where nuisance parameters are decorrelated to preserve inference validity. Theoretical guarantees rely on contraction and concentration bounds for first-stage estimators and a smooth, robust objective, while numerical experiments demonstrate favorable coverage and interval length across a range of contamination regimes. Overall, the methodology offers practical, adaptive robust CI construction applicable to linear models with arbitrary noise contamination and heavy-tailed noise.

Abstract

We study confidence interval construction for linear regression under Huber's contamination model, where an unknown fraction of noise variables is arbitrarily corrupted. While robust point estimation in this setting is well understood, statistical inference remains challenging, especially because the contamination proportion is not identifiable from the data. We develop a new algorithm that constructs confidence intervals for individual regression coefficients without any prior knowledge of the contamination level. Our method is based on a Z-estimation framework using a smooth estimating function. The method directly quantifies the uncertainty of the estimating equation after a preprocessing step that decorrelates covariates associated with the nuisance parameters. We show that the resulting confidence interval has valid coverage uniformly over all contamination distributions and attains an optimal length of order $O(1/\sqrt{n(1-ε)^2})$, matching the rate achievable when the contamination proportion $ε$ is known. This result stands in sharp contrast to the adaptation cost of robust interval estimation observed in the simpler Gaussian location model.

Confidence Intervals for Linear Models with Arbitrary Noise Contamination

TL;DR

This work develops a robust, adaptive confidence interval for a regression coordinate under Huber-type contamination with unknown contamination level . By framing CI construction as a Z-estimation problem with a smooth estimating function and a decorrelation step, the method achieves uniform coverage over all contamination distributions and attains the optimal length , matching the rate when is known. The approach extends from a univariate intuition to a multivariate regression setting, where nuisance parameters are decorrelated to preserve inference validity. Theoretical guarantees rely on contraction and concentration bounds for first-stage estimators and a smooth, robust objective, while numerical experiments demonstrate favorable coverage and interval length across a range of contamination regimes. Overall, the methodology offers practical, adaptive robust CI construction applicable to linear models with arbitrary noise contamination and heavy-tailed noise.

Abstract

We study confidence interval construction for linear regression under Huber's contamination model, where an unknown fraction of noise variables is arbitrarily corrupted. While robust point estimation in this setting is well understood, statistical inference remains challenging, especially because the contamination proportion is not identifiable from the data. We develop a new algorithm that constructs confidence intervals for individual regression coefficients without any prior knowledge of the contamination level. Our method is based on a Z-estimation framework using a smooth estimating function. The method directly quantifies the uncertainty of the estimating equation after a preprocessing step that decorrelates covariates associated with the nuisance parameters. We show that the resulting confidence interval has valid coverage uniformly over all contamination distributions and attains an optimal length of order , matching the rate achievable when the contamination proportion is known. This result stands in sharp contrast to the adaptation cost of robust interval estimation observed in the simpler Gaussian location model.

Paper Structure

This paper contains 24 sections, 14 theorems, 98 equations, 2 figures, 1 algorithm.

Key Result

Theorem 1

Consider i.i.d. samples $\{(x_i,w_i,y_i)\}_{i=1}^n$ generated from the linear model (eq:multiv-lin), with $\{(x_i,w_i)\}_{i=1}^n$ and $\{(z_i)\}_{i=1}^n$ independent from each other and satisfying as:bounded moments and (eq:noise-sig). For any $\alpha\in(0,1)$, there exist constants $c>0$ and $C>0$ as long as $\frac{p^2}{n(1-\epsilon)^4}\leq c$.

Figures (2)

  • Figure 1: Coverage against $\epsilon$ across settings.
  • Figure 2: Average interval length against $\epsilon$ across settings.

Theorems & Definitions (26)

  • Theorem 1
  • Remark 1
  • Proposition 1
  • Proposition 2
  • Lemma 1
  • Lemma 2
  • Theorem 2
  • Theorem 3
  • Lemma 3
  • proof
  • ...and 16 more