Table of Contents
Fetching ...

Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding

Ashesh Rambachan, Amanda Coston, Edward Kennedy

TL;DR

The paper tackles the problem of predictive algorithms trained and evaluated under selective labels, where outcomes are only observed for prioritized units, by introducing a covariate-dependent confounding bound $\delta(x)$ that partially identifies conditional likelihoods and performance metrics. It develops a unified framework that subsumes observed-outcome bounds, proxy-outcome bounds, and instrumental-variable bounds, and provides practical, debiased estimators with oracle-like guarantees for both design (estimating $P(Y^*=1|X)$) and evaluation (bounding overall and class-specific performance). The authors prove an oracle inequality for pseudo-outcome regression, derive regret bounds for plug-in rules, and establish asymptotic normality of debiased estimators under mild regularity, enabling valid inference across bounding strategies and performance measures. An empirical application to credit risk data demonstrates substantial re-ranking of risk scores and fairness conclusions that depend on the assumed level of unobserved confounding, underscoring the framework’s value for robust accountability in high-stakes settings.

Abstract

Predictive algorithms inform consequential decisions in settings with selective labels: outcomes are observed only for units selected by past decision makers. This creates an identification problem under unobserved confounding -- when selected and unselected units differ in unobserved ways that affect outcomes. We propose a framework for robust design and evaluation of predictive algorithms that bounds how much outcomes may differ between selected and unselected units with the same observed characteristics. These bounds formalize common empirical strategies including proxy outcomes and instrumental variables. Our estimators work across bounding strategies and performance measures such as conditional likelihoods, mean square error, and true/false positive rates. Using administrative data from a large Australian financial institution, we show that varying confounding assumptions substantially affects credit risk predictions and fairness evaluations across income groups.

Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding

TL;DR

The paper tackles the problem of predictive algorithms trained and evaluated under selective labels, where outcomes are only observed for prioritized units, by introducing a covariate-dependent confounding bound that partially identifies conditional likelihoods and performance metrics. It develops a unified framework that subsumes observed-outcome bounds, proxy-outcome bounds, and instrumental-variable bounds, and provides practical, debiased estimators with oracle-like guarantees for both design (estimating ) and evaluation (bounding overall and class-specific performance). The authors prove an oracle inequality for pseudo-outcome regression, derive regret bounds for plug-in rules, and establish asymptotic normality of debiased estimators under mild regularity, enabling valid inference across bounding strategies and performance measures. An empirical application to credit risk data demonstrates substantial re-ranking of risk scores and fairness conclusions that depend on the assumed level of unobserved confounding, underscoring the framework’s value for robust accountability in high-stakes settings.

Abstract

Predictive algorithms inform consequential decisions in settings with selective labels: outcomes are observed only for units selected by past decision makers. This creates an identification problem under unobserved confounding -- when selected and unselected units differ in unobserved ways that affect outcomes. We propose a framework for robust design and evaluation of predictive algorithms that bounds how much outcomes may differ between selected and unselected units with the same observed characteristics. These bounds formalize common empirical strategies including proxy outcomes and instrumental variables. Our estimators work across bounding strategies and performance measures such as conditional likelihoods, mean square error, and true/false positive rates. Using administrative data from a large Australian financial institution, we show that varying confounding assumptions substantially affects credit risk predictions and fairness evaluations across income groups.
Paper Structure (77 sections, 31 theorems, 242 equations, 7 figures, 3 tables)

This paper contains 77 sections, 31 theorems, 242 equations, 7 figures, 3 tables.

Key Result

Lemma 2.1

For all $x \in \mathcal{X}$, $\mathcal{H}(\mu^*(x)) = \left[\underline{\mu}^{*}(x), \overline{\mu}^{*}(x) \right],$ where $\overline{\mu}^{*}(x) = \mu_{1}(x) + \pi_0(x) \overline{\delta}(x; \eta)$, $\underline{\mu}^{*}(x) = \mu_1(x) + \pi_0(x) \underline{\delta}(x; \eta)$. Furthermore, where $\overline{\hbox{perf}}(s; \beta) = \mathbb{E}[\beta_{0,i} + \beta_{1,i} \mu_{1}(X_i) + \beta_{1,i} \pi_0(

Figures (7)

  • Figure 1: Estimated personal loan credit risk scores as assumptions on unobserved confounding vary.
  • Figure 2: Bounds on mean square error and ROC curve of benchmark risk score as we vary the assumption on unobserved confounding.
  • Figure 3: Bounds on mean square error and ROC curve of benchmark credit risk score across income groups as we vary the assumption on unobserved confounding.
  • Figure A1: Average integrated mean square of our estimator, the oracle learner, and the plug-in learner for the upper bound on the conditional probability $\overline{\mu}^*(\cdot)$.
  • Figure A2: Distribution of estimator for the upper bound on the true positive rate across Monte Carlo simulations with observed outcome bounds
  • ...and 2 more figures

Theorems & Definitions (56)

  • Lemma 2.1
  • Lemma 2.2
  • Remark 1: Connection to Algorithmic Fairness
  • Remark 2: Connection to Sensitivity Analysis Models in Causal Inference
  • Proposition 3.1
  • Proposition 3.2
  • Proposition 4.1
  • Proposition 4.2
  • Remark 3: Connection to existing work
  • Remark 4: Paths toward full double robustness
  • ...and 46 more