Table of Contents
Fetching ...

Predicting Long Term Sequential Policy Value Using Softer Surrogates

Hyunji Nam, Allen Nie, Ge Gao, Vasilis Syrgkanis, Emma Brunskill

TL;DR

This work tackles predicting the long-term value of a sequential policy when new actions are introduced and full horizon data are unavailable. It introduces soft surrogates and the dynamic invariance assumption to connect short-horizon on-policy data with long-horizon off-policy data, enabling estimation of $V^{\pi_e}$ from limited observations. The authors propose regression-based soft surrogate estimators and their doubly robust extensions, with finite-sample guarantees under covariate shift and cross-fitting. Empirical results in HIV treatment and Sepsis ICU simulations show accurate predictions using as little as $10\%$ of the full horizon, outperforming several baselines and offering significant practical value for high-stakes domains like healthcare. The work thus provides a principled way to evaluate novel sequential policies without lengthy trials, with broad implications for education and online systems alike.

Abstract

Off-policy policy evaluation (OPE) estimates the outcome of a new policy using historical data collected from a different policy. However, existing OPE methods cannot handle cases when the new policy introduces novel actions. This issue commonly occurs in real-world domains, like healthcare, as new drugs and treatments are continuously developed. Novel actions necessitate on-policy data collection, which can be burdensome and expensive if the outcome of interest takes a substantial amount of time to observe--for example, in multi-year clinical trials. This raises a key question of how to predict the long-term outcome of a policy after only observing its short-term effects? Though in general this problem is intractable, under some surrogacy conditions, the short-term on-policy data can be combined with the long-term historical data to make accurate predictions about the new policy's long-term value. In two simulated healthcare examples--HIV and sepsis management--we show that our estimators can provide accurate predictions about the policy value only after observing 10\% of the full horizon data. We also provide finite sample analysis of our doubly robust estimators.

Predicting Long Term Sequential Policy Value Using Softer Surrogates

TL;DR

This work tackles predicting the long-term value of a sequential policy when new actions are introduced and full horizon data are unavailable. It introduces soft surrogates and the dynamic invariance assumption to connect short-horizon on-policy data with long-horizon off-policy data, enabling estimation of from limited observations. The authors propose regression-based soft surrogate estimators and their doubly robust extensions, with finite-sample guarantees under covariate shift and cross-fitting. Empirical results in HIV treatment and Sepsis ICU simulations show accurate predictions using as little as of the full horizon, outperforming several baselines and offering significant practical value for high-stakes domains like healthcare. The work thus provides a principled way to evaluate novel sequential policies without lengthy trials, with broad implications for education and online systems alike.

Abstract

Off-policy policy evaluation (OPE) estimates the outcome of a new policy using historical data collected from a different policy. However, existing OPE methods cannot handle cases when the new policy introduces novel actions. This issue commonly occurs in real-world domains, like healthcare, as new drugs and treatments are continuously developed. Novel actions necessitate on-policy data collection, which can be burdensome and expensive if the outcome of interest takes a substantial amount of time to observe--for example, in multi-year clinical trials. This raises a key question of how to predict the long-term outcome of a policy after only observing its short-term effects? Though in general this problem is intractable, under some surrogacy conditions, the short-term on-policy data can be combined with the long-term historical data to make accurate predictions about the new policy's long-term value. In two simulated healthcare examples--HIV and sepsis management--we show that our estimators can provide accurate predictions about the policy value only after observing 10\% of the full horizon data. We also provide finite sample analysis of our doubly robust estimators.
Paper Structure (45 sections, 9 theorems, 50 equations, 1 figure, 8 tables, 1 algorithm)

This paper contains 45 sections, 9 theorems, 50 equations, 1 figure, 8 tables, 1 algorithm.

Key Result

Theorem 1

Assume that $|G|, |f^{(k)}(\tau)|, |f(\tau)|$ are asymptotically bounded by $C_1H$ and $\hat{h}^{(k)}(\tau), h(\tau)$ are asymptotically bounded by $C_2$, where $C_1$ and $C_2$ are constants. Under Assumptions assmp:surrogacy and assmp:coverage, w.p. at least $1-\delta$:

Figures (1)

  • Figure 1: We are interested in the task of predicting the patient's long-term outcome only after a short observation window. The standard surrogacy assumption athey2019 fails due to the red trajectory where later decisions cause a very different outcome than other trajectories with the same initial $h$ observations. In this work we leverage a narrower definition of surrogacy similar to battochi2021, and show it enables us to perform effective policy evaluation.

Theorems & Definitions (14)

  • Theorem 1: Variance-Based Rate for DR
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Theorem 3: Doubly Robust Bias Bound
  • proof
  • Theorem 4: Variance-Based Rate for DR
  • proof
  • Theorem 5
  • proof
  • ...and 4 more