Confidence Intervals for Linear Models with Arbitrary Noise Contamination
Dong Xie, Chao Gao, John Lafferty
TL;DR
This work develops a robust, adaptive confidence interval for a regression coordinate under Huber-type contamination with unknown contamination level $\epsilon$. By framing CI construction as a Z-estimation problem with a smooth estimating function and a decorrelation step, the method achieves uniform coverage over all contamination distributions and attains the optimal length $O\left(\frac{1}{\sqrt{n(1-\epsilon)^2}}\right)$, matching the rate when $\epsilon$ is known. The approach extends from a univariate intuition to a multivariate regression setting, where nuisance parameters are decorrelated to preserve inference validity. Theoretical guarantees rely on contraction and concentration bounds for first-stage estimators and a smooth, robust objective, while numerical experiments demonstrate favorable coverage and interval length across a range of contamination regimes. Overall, the methodology offers practical, adaptive robust CI construction applicable to linear models with arbitrary noise contamination and heavy-tailed noise.
Abstract
We study confidence interval construction for linear regression under Huber's contamination model, where an unknown fraction of noise variables is arbitrarily corrupted. While robust point estimation in this setting is well understood, statistical inference remains challenging, especially because the contamination proportion is not identifiable from the data. We develop a new algorithm that constructs confidence intervals for individual regression coefficients without any prior knowledge of the contamination level. Our method is based on a Z-estimation framework using a smooth estimating function. The method directly quantifies the uncertainty of the estimating equation after a preprocessing step that decorrelates covariates associated with the nuisance parameters. We show that the resulting confidence interval has valid coverage uniformly over all contamination distributions and attains an optimal length of order $O(1/\sqrt{n(1-ε)^2})$, matching the rate achievable when the contamination proportion $ε$ is known. This result stands in sharp contrast to the adaptation cost of robust interval estimation observed in the simpler Gaussian location model.
