Finite Sample Confidence Regions for Linear Regression Parameters Using Arbitrary Predictors

Charles Guille-Escuret; Eugene Ndiaye

Finite Sample Confidence Regions for Linear Regression Parameters Using Arbitrary Predictors

Charles Guille-Escuret, Eugene Ndiaye

TL;DR

This work introduces Residual Intervals Inversion (RII), a method to construct finite-sample confidence regions for linear regression parameters using predictions from any predictor, under a minimal noise assumption that the conditional median of the noise is zero with an adjustable tolerance $b$. The confidence region $\Theta_{\alpha}$ is defined via residual intervals and can be represented exactly as an MILP, enabling optimization of linear (and certain quadratic) objectives over the region and providing a mechanism for hypothesis testing when the region is empty. The authors prove finite-sample validity, derive a boundedness condition, and demonstrate practical applications to coordinates-based confidence intervals and robust optimization, as well as a hypothesis-testing capability for linearity. Experiments on linear and non-linear toy data with various noise types show finite-sample coverage guarantees, highlight robustness to non-Gaussian noise, and illustrate the trade-off between predictor quality, region size, and computational cost. Overall, RII expands confidence-region methodology beyond Gaussian noise and linear estimators, supporting flexible predictors and downstream optimization tasks, albeit with computational considerations for high-dimensional problems.

Abstract

We explore a novel methodology for constructing confidence regions for parameters of linear models, using predictions from any arbitrary predictor. Our framework requires minimal assumptions on the noise and can be extended to functions deviating from strict linearity up to some adjustable threshold, thereby accommodating a comprehensive and pragmatically relevant set of functions. The derived confidence regions can be cast as constraints within a Mixed Integer Linear Programming framework, enabling optimisation of linear objectives. This representation enables robust optimization and the extraction of confidence intervals for specific parameter coordinates. Unlike previous methods, the confidence region can be empty, which can be used for hypothesis testing. Finally, we validate the empirical applicability of our method on synthetic data.

Finite Sample Confidence Regions for Linear Regression Parameters Using Arbitrary Predictors

TL;DR

. The confidence region

is defined via residual intervals and can be represented exactly as an MILP, enabling optimization of linear (and certain quadratic) objectives over the region and providing a mechanism for hypothesis testing when the region is empty. The authors prove finite-sample validity, derive a boundedness condition, and demonstrate practical applications to coordinates-based confidence intervals and robust optimization, as well as a hypothesis-testing capability for linearity. Experiments on linear and non-linear toy data with various noise types show finite-sample coverage guarantees, highlight robustness to non-Gaussian noise, and illustrate the trade-off between predictor quality, region size, and computational cost. Overall, RII expands confidence-region methodology beyond Gaussian noise and linear estimators, supporting flexible predictors and downstream optimization tasks, albeit with computational considerations for high-dimensional problems.

Abstract

Paper Structure (27 sections, 3 theorems, 46 equations, 3 figures, 6 tables)

This paper contains 27 sections, 3 theorems, 46 equations, 3 figures, 6 tables.

Introduction
Related Work
Problem Setting
Assumptions
Objectives
Construction of Confidence Regions
Building Residual Intervals
Building the Confidence Region
Representation as a MILP feasible set
Boundedness of $\Theta_{\alpha}$
Applications
Confidence Interval on Coordinates
Robust Optimization
Hypothesis Testing
Experiments
...and 12 more sections

Key Result

Lemma 1

Under assum:cond_inde and (assum:A_a or assum:A_b), for any test point $(X, Y)$ with prediction $\widehat{Y}$, it holds

Figures (3)

Figure 1: Guaranteed coverage $1-\alpha=S_{n_{\rm te}}(k,b)$ from \ref{['eq:coverage_guarantee']} for $n_{te}=30$ and $k\in[4,8,12,16]$.
Figure 2: Illustration of residual intervals on synthetic datasets with both linear and non-linear dependence between input $X$ and output $Y$.
Figure 3: Illustration of the bounds covering the ground-truth parameter $\theta_\star$ under various configurations of noise, for $\alpha=0.1$. Squares correspond to upper bounds while circles denote corresponding lower bounds.

Theorems & Definitions (4)

Lemma 1
Theorem 1
Proposition 1
Remark 1

Finite Sample Confidence Regions for Linear Regression Parameters Using Arbitrary Predictors

TL;DR

Abstract

Finite Sample Confidence Regions for Linear Regression Parameters Using Arbitrary Predictors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (4)