Predictive Performance Comparison of Decision Policies Under Confounding

Luke Guerdan; Amanda Coston; Kenneth Holstein; Zhiwei Steven Wu

Predictive Performance Comparison of Decision Policies Under Confounding

Luke Guerdan, Amanda Coston, Kenneth Holstein, Zhiwei Steven Wu

TL;DR

This paper tackles the problem of comparing predictive decision policies to a status quo in the presence of unobserved confounding, focusing on pre-deployment evaluation rather than post-hoc assessment. It develops a partial-identification framework that localizes confounding-induced uncertainty to the policy disagreement region and introduces a novel $\delta$-regret interval to bound the difference in policy performance more tightly than traditional baselines. By linking a range of modern causal identification assumptions (e.g., instrumental variables, marginal sensitivity models, proximal variables) to pointwise bounding functions, the authors provide a flexible method to estimate finite-sample regret bounds via plug-in and doubly robust estimators with cross-fitting. The approach is validated on synthetic data under MSM and IV scenarios and demonstrated on a real healthcare enrollment setting, where it can yield more decisive pre-deployment conclusions than existing non-comparative OPE methods. Overall, the framework advances confounding-robust, pre-deployment policy evaluation by delivering informative, assumption-tunable regret bounds that focus on the most informative regions of the action space.

Abstract

Predictive models are often introduced to decision-making tasks under the rationale that they improve performance over an existing decision-making policy. However, it is challenging to compare predictive performance against an existing decision-making policy that is generally under-specified and dependent on unobservable factors. These sources of uncertainty are often addressed in practice by making strong assumptions about the data-generating mechanism. In this work, we propose a method to compare the predictive performance of decision policies under a variety of modern identification approaches from the causal inference and off-policy evaluation literatures (e.g., instrumental variable, marginal sensitivity model, proximal variable). Key to our method is the insight that there are regions of uncertainty that we can safely ignore in the policy comparison. We develop a practical approach for finite-sample estimation of regret intervals under no assumptions on the parametric form of the status quo policy. We verify our framework theoretically and via synthetic data experiments. We conclude with a real-world application using our framework to support a pre-deployment evaluation of a proposed modification to a healthcare enrollment policy.

Predictive Performance Comparison of Decision Policies Under Confounding

TL;DR

Abstract

Paper Structure (40 sections, 16 theorems, 81 equations, 14 figures, 1 algorithm)

This paper contains 40 sections, 16 theorems, 81 equations, 14 figures, 1 algorithm.

Introduction
Related Work
Preliminaries
Problem Formulation
Partial Identification of Policy Performance
Regret Bound Identification
Mapping Causal Assumptions to Informative Regret Bounds
Regret Bound Estimation
Plug-in Estimator
Doubly Robust Estimator
Numerical Experiments
Real-World Application: Comparing Healthcare Enrollment Policies
Conclusion
Asymptotic Regret Bounds
$\delta$-regret bounds on utility regret.
...and 25 more sections

Key Result

Theorem 4.3

Let $\Delta(m, \mathcal{V}) = I(m, \mathcal{V}) - I_{\delta}(m, \mathcal{V})$. Then the $\delta$-regret interval offers the following improvement over the baseline regret interval where $\alpha = \overline{v}_{0}(0, 0) - \underline{v}_{0}(0, 0)$, $\psi_0(\pi) = p(A^{\pi} = 0)$, and $\overline{\gamma}_{y} = \sum_{a}\sum_{a'} \overline{v}_{y}(a,a')$.

Figures (14)

Figure 1: Illustration of uncertainty in comparing two policies in a toy setting with $X \in \mathcal{R}^2$. Points are labelled by their outcome: positive (+), negative (-) or unknown (?). Ovals denote the selection region of a policy. Points that neither policy selects (denoted by grey) are irrelevant to the comparison. Our method leverages this to reduce policy comparison uncertainty.
Figure 2: Flow of assumptions in our framework. (A) Traditional causal assumptions imply pointwise bounding functions on the unobserved outcome (Appendix \ref{['appendix:assumption_extensions']}); (B) Pointwise bounding functions imply constrained uncertainty sets (Lemma \ref{['lemma:assumption_mapping']}); (C) Constrained uncertainty sets imply policy regret bounds (Section \ref{['sec:bound_identification']}).
Figure 3: Improvement in bounds offered by the $\delta$-regret interval over the baseline interval. We systematically vary the relative size of $v$-statistics and plot bounds as a function of interval improvement $\Delta(m) = I(m) - I_{\delta}(m)$ characterized by Theorem \ref{['thm:delta_seperation']}.
Figure 4: Comparison between $\delta$-regret and baseline regret interval end-points averaged over $N=20$ trials and $N_s=20,000$ samples. First row leverages an MSM identification assumption, while the second leverages an IV assumption.
Figure 5: Top: Coverage of accuracy regret interval estimates as a function of total sample size. Bottom: 95% bootstrap confidence intervals around upper and lower regret bounds over $N=25$ trials. Solid line indicates the oracle regret.
...and 9 more figures

Theorems & Definitions (33)

Example 1.1: Human vs algorithm decisions
Example 1.2: Human$+$algorithm vs algorithm decisions
Definition 4.1: Baseline regret interval
Definition 4.2: $\delta$-regret interval
Theorem 4.3: Regret separation
Lemma 5.2: Assumption mapping
Theorem 5.3: Minimality
Theorem 6.1
Lemma 1.1: $\delta_u$-regret bounds
proof
...and 23 more

Predictive Performance Comparison of Decision Policies Under Confounding

TL;DR

Abstract

Predictive Performance Comparison of Decision Policies Under Confounding

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (33)