Table of Contents
Fetching ...

Robustness Auditing for Linear Regression: To Singularity and Beyond

Ittai Rubinstein, Samuel B. Hopkins

Abstract

It has recently been discovered that the conclusions of many highly influential econometrics studies can be overturned by removing a very small fraction of their samples (often less than $0.5\%$). These conclusions are typically based on the results of one or more Ordinary Least Squares (OLS) regressions, raising the question: given a dataset, can we certify the robustness of an OLS fit on this dataset to the removal of a given number of samples? Brute-force techniques quickly break down even on small datasets. Existing approaches which go beyond brute force either can only find candidate small subsets to remove (but cannot certify their non-existence) [BGM20, KZC21], are computationally intractable beyond low dimensional settings [MR22], or require very strong assumptions on the data distribution and too many samples to give reasonable bounds in practice [BP21, FH23]. We present an efficient algorithm for certifying the robustness of linear regressions to removals of samples. We implement our algorithm and run it on several landmark econometrics datasets with hundreds of dimensions and tens of thousands of samples, giving the first non-trivial certificates of robustness to sample removal for datasets of dimension $4$ or greater. We prove that under distributional assumptions on a dataset, the bounds produced by our algorithm are tight up to a $1 + o(1)$ multiplicative factor.

Robustness Auditing for Linear Regression: To Singularity and Beyond

Abstract

It has recently been discovered that the conclusions of many highly influential econometrics studies can be overturned by removing a very small fraction of their samples (often less than ). These conclusions are typically based on the results of one or more Ordinary Least Squares (OLS) regressions, raising the question: given a dataset, can we certify the robustness of an OLS fit on this dataset to the removal of a given number of samples? Brute-force techniques quickly break down even on small datasets. Existing approaches which go beyond brute force either can only find candidate small subsets to remove (but cannot certify their non-existence) [BGM20, KZC21], are computationally intractable beyond low dimensional settings [MR22], or require very strong assumptions on the data distribution and too many samples to give reasonable bounds in practice [BP21, FH23]. We present an efficient algorithm for certifying the robustness of linear regressions to removals of samples. We implement our algorithm and run it on several landmark econometrics datasets with hundreds of dimensions and tens of thousands of samples, giving the first non-trivial certificates of robustness to sample removal for datasets of dimension or greater. We prove that under distributional assumptions on a dataset, the bounds produced by our algorithm are tight up to a multiplicative factor.

Paper Structure

This paper contains 93 sections, 12 theorems, 275 equations, 2 figures, 2 tables, 8 algorithms.

Key Result

Theorem 1.1

Given $e,X$, and $Y$, alg:ACRE and alg:OHARE output lists of upper/lower bounds $U, L \in \mathbb{R}^n$ s.t.

Figures (2)

  • Figure 1: A comparison of two regressions. Figure \ref{['fig:sub1']} shows a regression from a main variable $X_1$ and an indicator variable $X_2$ which is set to $1$ on only a very small subset of the samples ($\approx 1\%$). The label values $Y$ are drawn iid from a normal distribution around $X_1$, resulting in an OLS vector $\beta$ whose first coefficient is positive and whose sign is robust to removing any 158 samples. We use $e = (1,0)$. Using the procedure detailed in Claim \ref{['clm:one_hot_brittle']}, we perturb only the $X_2$ values of the inlier samples to produce an extremely brittle regression (Figure \ref{['fig:sub2']}). Because most current efficient approaches to estimating the robustness of a linear regression produce outputs which vary smoothly with the input dataset (such as gradient descent broderick2020automatic, semidefinite programming bakshi2021robust, or spectral decompositions freund2023towards), they cannot be used to differentiate between these cases.
  • Figure 2: A comparison of our \ref{['alg:ACRE']} and \ref{['alg:OHARE']} algorithms with previous techniques. In Figure \ref{['fig:barplot1']}, we plot the number of removals required to flip the sign of several linear regressions from landmark econometrics studies martinez2022muchangelucci2009indirectfinkelstein2012oregon. Each of these studies contains a number of linear regression central to their analyses, which include several applications of linear regression, such as estimating correlation controlled for additional covariates, treatment effects, and instrumental variables regression. For each regression, we run AMIP broderick2020automatic and KZC kuschnig2021hidden to obtain base-line upper bounds on $k_\textup{sign}$ and compare the results to lower bounds produced by \ref{['alg:OHARE']} . We list the number of samples and the dimension of each regression below the plot. In Figure \ref{['fig:barplot2']}, we consider a synthetic dataset comprised of $n=4000$ samples in dimension $d=50$, and plot bounds on the removal effects $\Delta_k(e)$. In this plot, the roles are reversed, with AMIP and KZC representing lower bounds on the removal effects, while our \ref{['alg:ACRE']} algorithm gives the first practical upper bound. We compare the bounds produced by ACRE to the previous state-of-the-art for efficiently computable upper bounds freund2023towards. Moreover, to ground the scale of the plot, consider the different bounds on $k_{2 \sigma}$ (the number of removals required to shift the regression results outside of their $95\%$ confidence intervals). The ACRE algorithm has two possible backends -- spectral or RTI (see Section \ref{['sec:prob1_algs']}). RTI is more efficient and performs better in practice, while the spectral analysis which uses ideas from Freund and Hopkins' algorithm has a somewhat slower runtime ($\widetilde{O}\left(n^3\right)$) and offers a logarithmic advantage in some synthetic datasets. The bound produced by ACRE is almost tight on this range of values of $k$, while Freund and Hopkins' algorithm yields a trivial bound.

Theorems & Definitions (61)

  • Theorem 1.1: Correctness
  • Definition 1: Well-Behaved Distribution
  • Theorem 1.2: Tightness of \ref{['alg:ACRE']} , Proof in Section \ref{['sec:analysis']}
  • Theorem 1.3: Tightness of \ref{['alg:OHARE']} , informal, see Section \ref{['sec:ohare_tightness']}
  • Claim 4.1
  • Theorem 7.1: \ref{['alg:ACRE']} Bounds are Tight on Well-behaved Data
  • Definition 2: ACRE-friendly
  • Claim 7.2: ACRE bounds are nearly tight on ACRE-friendly regressions
  • Claim 7.3: Well-behaved distributions yield ACRE-friendly regressions with high probability
  • Lemma 7.4: Matrix Bernstein tropp2015introduction
  • ...and 51 more