Table of Contents
Fetching ...

Short-Long Policy Evaluation with Novel Actions

Hyunji Alex Nam, Yash Chandak, Emma Brunskill

TL;DR

This work tackles the challenge of evaluating long-horizon outcomes for policies that introduce novel actions, using only short-horizon observations and historical off-policy data. It introduces two methods: SLEV, a covariate-shift regression approach, and SLED, a Markov-dynamics-based method with a low-dimensional adapter to capture new actions. The authors provide theoretical guarantees for SLEV and demonstrate substantial empirical gains across HIV treatment, kidney dialysis, and battery charging domains, including the ability to detect suboptimal policies early for safety. The work offers a practical framework for rapid long-horizon policy assessment in domains with long evaluation times, such as education, healthcare, and energy management, and outlines avenues for extending the approach with confidence intervals and action-embedding techniques.

Abstract

From incorporating LLMs in education, to identifying new drugs and improving ways to charge batteries, innovators constantly try new strategies in search of better long-term outcomes for students, patients and consumers. One major bottleneck in this innovation cycle is the amount of time it takes to observe the downstream effects of a decision policy that incorporates new interventions. The key question is whether we can quickly evaluate long-term outcomes of a new decision policy without making long-term observations. Organizations often have access to prior data about past decision policies and their outcomes, evaluated over the full horizon of interest. Motivated by this, we introduce a new setting for short-long policy evaluation for sequential decision making tasks. Our proposed methods significantly outperform prior results on simulators of HIV treatment, kidney dialysis and battery charging. We also demonstrate that our methods can be useful for applications in AI safety by quickly identifying when a new decision policy is likely to have substantially lower performance than past policies.

Short-Long Policy Evaluation with Novel Actions

TL;DR

This work tackles the challenge of evaluating long-horizon outcomes for policies that introduce novel actions, using only short-horizon observations and historical off-policy data. It introduces two methods: SLEV, a covariate-shift regression approach, and SLED, a Markov-dynamics-based method with a low-dimensional adapter to capture new actions. The authors provide theoretical guarantees for SLEV and demonstrate substantial empirical gains across HIV treatment, kidney dialysis, and battery charging domains, including the ability to detect suboptimal policies early for safety. The work offers a practical framework for rapid long-horizon policy assessment in domains with long evaluation times, such as education, healthcare, and energy management, and outlines avenues for extending the approach with confidence intervals and action-embedding techniques.

Abstract

From incorporating LLMs in education, to identifying new drugs and improving ways to charge batteries, innovators constantly try new strategies in search of better long-term outcomes for students, patients and consumers. One major bottleneck in this innovation cycle is the amount of time it takes to observe the downstream effects of a decision policy that incorporates new interventions. The key question is whether we can quickly evaluate long-term outcomes of a new decision policy without making long-term observations. Organizations often have access to prior data about past decision policies and their outcomes, evaluated over the full horizon of interest. Motivated by this, we introduce a new setting for short-long policy evaluation for sequential decision making tasks. Our proposed methods significantly outperform prior results on simulators of HIV treatment, kidney dialysis and battery charging. We also demonstrate that our methods can be useful for applications in AI safety by quickly identifying when a new decision policy is likely to have substantially lower performance than past policies.
Paper Structure (47 sections, 3 theorems, 23 equations, 9 figures, 2 tables)

This paper contains 47 sections, 3 theorems, 23 equations, 9 figures, 2 tables.

Key Result

Theorem 4.1

For $\hat{f}^* \in \mathcal{F}$ (assume finite class $F = |\mathcal{F}|$) and a training dataset of size $n$, with probability at least $1-4\delta$, where $M := \max_{x} \{w(x), \hat{w}(x)\}$ and $\hat{\mathcal{R}}_{\text{train}}(\hat{w} \hat{f}^*) := \sum_{(x, y) \in \mathcal{D}^{\text{train}}} \hat{w} (x) (\hat{f}^*(x) - y)^2$.

Figures (9)

  • Figure 1: Short-long policy evaluation predicts the long-term outcome of a target policy with novel actions using only short-term observations and historical off-policy data $\mathcal{D}$.
  • Figure 2: Comparison of methods on HIV and kidney dialysis simulators. The horizon length of 20 corresponds to 10% of the full horizon length in the HIV treatment; 3 corresponds to 10% in the Kidney domain. Shades show standard deviation of errors from 3 seeds.
  • Figure 3: Comparison of different methods' accuracy of identifying safe versus unsafe policies over varying horizon lengths. Shades show standard dev from 3 seeds.
  • Figure 4: Two different battery cell curves, one with the lifetime of 719 cycles, and the other one with the lifetime of 487 cycles.
  • Figure 5: Battery curves, same ones from Fig. 6, become right-aligned after being shifted by their life cycles along the $x$-axis. Both curves can be explained by a global curve $f$.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Theorem 4.1
  • Theorem C.1
  • Theorem C.2