Table of Contents
Fetching ...

CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies

Brian M Cho, Ana-Roxana Pop, Kyra Gan, Sam Corbett-Davies, Israel Nir, Ariel Evnine, Nathan Kallus

TL;DR

The paper tackles safe policy improvement for threshold policies by introducing CSPI and CSPI-MT, which provide asymptotically calibrated tests that control the risk of adopting a worse policy at level $\\gamma$ while enabling testing of multiple cutoff candidates. It builds a data-splitting framework with efficient influence-function based estimators and extends single-cutoff testing to a robust multi-cutoff approach using joint confidence bands. The contributions include practical heuristics for cutoff selection that balance passing probability with expected improvement, plus extensive empirical evidence on synthetic and semi-synthetic data showing improved detection power and calibrated error control in low signal-to-noise scenarios. The work has direct implications for high-risk decision-making in economics, healthcare, and digital advertising where threshold policies are common and safety guarantees are essential.

Abstract

When modifying existing policies in high-risk settings, it is often necessary to ensure with high certainty that the newly proposed policy improves upon a baseline, such as the status quo. In this work, we consider the problem of safe policy improvement, where one only adopts a new policy if it is deemed to be better than the specified baseline with at least pre-specified probability. We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising. Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements, so too often they must revert to the baseline to maintain safety. We overcome these issues by leveraging the most powerful safety test in the asymptotic regime and allowing for multiple candidates to be tested for improvement over the baseline. We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level, even in moderate sample sizes. We present CSPI and CSPI-MT, two novel heuristics for selecting cutoff(s) to maximize the policy improvement from baseline. We demonstrate through both synthetic and external datasets that our approaches improve both the detection rates of safe policies and the realized improvement, particularly under stringent safety requirements and low signal-to-noise conditions.

CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies

TL;DR

The paper tackles safe policy improvement for threshold policies by introducing CSPI and CSPI-MT, which provide asymptotically calibrated tests that control the risk of adopting a worse policy at level while enabling testing of multiple cutoff candidates. It builds a data-splitting framework with efficient influence-function based estimators and extends single-cutoff testing to a robust multi-cutoff approach using joint confidence bands. The contributions include practical heuristics for cutoff selection that balance passing probability with expected improvement, plus extensive empirical evidence on synthetic and semi-synthetic data showing improved detection power and calibrated error control in low signal-to-noise scenarios. The work has direct implications for high-risk decision-making in economics, healthcare, and digital advertising where threshold policies are common and safety guarantees are essential.

Abstract

When modifying existing policies in high-risk settings, it is often necessary to ensure with high certainty that the newly proposed policy improves upon a baseline, such as the status quo. In this work, we consider the problem of safe policy improvement, where one only adopts a new policy if it is deemed to be better than the specified baseline with at least pre-specified probability. We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising. Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements, so too often they must revert to the baseline to maintain safety. We overcome these issues by leveraging the most powerful safety test in the asymptotic regime and allowing for multiple candidates to be tested for improvement over the baseline. We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level, even in moderate sample sizes. We present CSPI and CSPI-MT, two novel heuristics for selecting cutoff(s) to maximize the policy improvement from baseline. We demonstrate through both synthetic and external datasets that our approaches improve both the detection rates of safe policies and the realized improvement, particularly under stringent safety requirements and low signal-to-noise conditions.
Paper Structure (28 sections, 2 theorems, 11 equations, 3 figures, 5 algorithms)

This paper contains 28 sections, 2 theorems, 11 equations, 3 figures, 5 algorithms.

Key Result

Proposition 1

Assume that (i) estimated functions $\hat{\mu}_k$, $\hat{e}_k$ are bounded with respect to $P$ almost surely, (ii) $\| \hat{\mu}_k - \mu\|_{P,2}\times \|\hat{e}_k - e \|_{P, 2} = o_P(1/\sqrt{n})$, and (iii) $\mathbb{P}_P(S \in [\min(c,c_0), \max(c,c_0)])$ is bounded above 0. Then, our confidence int Furthermore, our variance estimate $\widehat{\Sigma}(c) \rightarrow {\Sigma}(c)$ converges to the m

Figures (3)

  • Figure 1: Visual comparison of Algorithms \ref{['alg:single_selection']} and \ref{['alg:multi_selection']} on DGP2, defined in Section \ref{['sec:empirics']}. Left and Middle: selecting cutoff policies on $D_{tune}$. EPD: estimated policy value difference(s). Right: Testing selected policies on $D_{test}$. The estimated lower confidence bound for each cutoff is indicated by the corresponding horizontal bar.
  • Figure 2: Pass rate of selected cutoffs and expected improvement across $\gamma$ values. Top row: Pass rates from left to right for DGP1, DGP2, JOBS (with baseline: treat-none). Bottom row: Expected improvement compared to HCPI (t-test) for DGP1, DGP2, JOBS.
  • Figure 3: Error Rates for DGP3, with $n=2000$.

Theorems & Definitions (8)

  • Definition 1: Policy Value
  • Definition 2: Threshold Policy Class
  • Definition 3: Asymptotic Safety
  • Definition 4: Calibration Error
  • Definition 5: Expected Improvement
  • Proposition 1: Asymptotic Correctness and Efficiency of Our Safety Test
  • Proposition 2: Asymptotic Correctness of Sup-$t$ band
  • Remark 1: Computational Justification for Our Heuristic