Robust Design and Evaluation of Predictive Algorithms under Unobserved Confounding
Ashesh Rambachan, Amanda Coston, Edward Kennedy
TL;DR
The paper tackles the problem of predictive algorithms trained and evaluated under selective labels, where outcomes are only observed for prioritized units, by introducing a covariate-dependent confounding bound $\delta(x)$ that partially identifies conditional likelihoods and performance metrics. It develops a unified framework that subsumes observed-outcome bounds, proxy-outcome bounds, and instrumental-variable bounds, and provides practical, debiased estimators with oracle-like guarantees for both design (estimating $P(Y^*=1|X)$) and evaluation (bounding overall and class-specific performance). The authors prove an oracle inequality for pseudo-outcome regression, derive regret bounds for plug-in rules, and establish asymptotic normality of debiased estimators under mild regularity, enabling valid inference across bounding strategies and performance measures. An empirical application to credit risk data demonstrates substantial re-ranking of risk scores and fairness conclusions that depend on the assumed level of unobserved confounding, underscoring the framework’s value for robust accountability in high-stakes settings.
Abstract
Predictive algorithms inform consequential decisions in settings with selective labels: outcomes are observed only for units selected by past decision makers. This creates an identification problem under unobserved confounding -- when selected and unselected units differ in unobserved ways that affect outcomes. We propose a framework for robust design and evaluation of predictive algorithms that bounds how much outcomes may differ between selected and unselected units with the same observed characteristics. These bounds formalize common empirical strategies including proxy outcomes and instrumental variables. Our estimators work across bounding strategies and performance measures such as conditional likelihoods, mean square error, and true/false positive rates. Using administrative data from a large Australian financial institution, we show that varying confounding assumptions substantially affects credit risk predictions and fairness evaluations across income groups.
