Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target

Alexander Guyer; Thomas G. Dietterich

Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target

Alexander Guyer, Thomas G. Dietterich

TL;DR

This work addresses the need for calibrated, conditional probabilities that an autonomous system will achieve a user-specified behavior target in an MDP. It extends conformal prediction by moving the conformal correction into probability space, yielding Probability-space Conformalized Quantile Regression (PCQR) and an invertible formulation PCQR-1 that provides a calibrated conditional CDF for target intervals. The key contributions are identifying theInvertibility issue of CQR, introducing probability-space conformity scores, achieving an invertible PCQR with finite-sample marginal guarantees, and validating well-calibrated coverage in Starcraft 2 and Tamarisk domains. The approach enables early, reliable alarms when the probability of meeting a target falls below a threshold, with practical implications for deploying autonomous systems under uncertainty.

Abstract

As an autonomous system performs a task, it should maintain a calibrated estimate of the probability that it will achieve the user's goal. If that probability falls below some desired level, it should alert the user so that appropriate interventions can be made. This paper considers settings where the user's goal is specified as a target interval for a real-valued performance summary, such as the cumulative reward, measured at a fixed horizon $H$. At each time $t \in \{0, \ldots, H-1\}$, our method produces a calibrated estimate of the probability that the final cumulative reward will fall within a user-specified target interval $[y^-,y^+].$ Using this estimate, the autonomous system can raise an alarm if the probability drops below a specified threshold. We compute the probability estimates by inverting conformal prediction. Our starting point is the Conformalized Quantile Regression (CQR) method of Romano et al., which applies split-conformal prediction to the results of quantile regression. CQR is not invertible, but by using the conditional cumulative distribution function (CDF) as the non-conformity measure, we show how to obtain an invertible modification that we call Probability-space Conformalized Quantile Regression (PCQR). Like CQR, PCQR produces well-calibrated conditional prediction intervals with finite-sample marginal guarantees. By inverting PCQR, we obtain guarantees for the probability that the cumulative reward of an autonomous system will fall below a threshold sampled from the marginal distribution of the response variable (i.e., a calibrated CDF estimate) that we employ to predict coverage probabilities for user-specified target intervals. Experiments on two domains confirm that these probabilities are well-calibrated.

Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target

TL;DR

Abstract

. At each time

, our method produces a calibrated estimate of the probability that the final cumulative reward will fall within a user-specified target interval

Using this estimate, the autonomous system can raise an alarm if the probability drops below a specified threshold. We compute the probability estimates by inverting conformal prediction. Our starting point is the Conformalized Quantile Regression (CQR) method of Romano et al., which applies split-conformal prediction to the results of quantile regression. CQR is not invertible, but by using the conditional cumulative distribution function (CDF) as the non-conformity measure, we show how to obtain an invertible modification that we call Probability-space Conformalized Quantile Regression (PCQR). Like CQR, PCQR produces well-calibrated conditional prediction intervals with finite-sample marginal guarantees. By inverting PCQR, we obtain guarantees for the probability that the cumulative reward of an autonomous system will fall below a threshold sampled from the marginal distribution of the response variable (i.e., a calibrated CDF estimate) that we employ to predict coverage probabilities for user-specified target intervals. Experiments on two domains confirm that these probabilities are well-calibrated.

Paper Structure (11 sections, 3 theorems, 26 equations, 4 figures)

This paper contains 11 sections, 3 theorems, 26 equations, 4 figures.

Introduction
Related work
Contributions
Method
CQR is not Invertible
Introducing Probability-Space Conformity Scores
PCQR is Invertible
Experiments
Conclusion
Future work
Acknowledgments

Key Result

Theorem 1

Given an exchangeable sequence $(\mathbf{x_1}, y_1), \ldots, (\mathbf{x_{n+1}}, y_{n+1})$ and almost surely unique conformity scores $S(\mathbf{x_1}, y_1), \ldots, S(\mathbf{x_{n+1}}, y_{n+1})$, the following inequalities hold:

Figures (4)

Figure 1: Mean Expected Calibration Error (ECE) across all $k$ data partitions vs. time for the PCQR-1 probability lower and upper bound predictions in the Starcraft 2 and Tamarisk domains. ECE is measured with 30 equal-width bins. Standard deviations are depicted by the semi-transparent regions.
Figure 2: Reliability diagrams for the PCQR-1 probability lower bound predictions in the Starcraft 2 and Tamarisk domains. Data from all data partitions and time steps are binned into 30 equal-width bins according to the predicted coverage probabilities. The gray histograms give the bin counts. The x-axis depicts the mean predicted coverage probability in each bin. The y-axis shows the observed coverage rate in each bin. The blue line corresponds to the experimental results, and the black dashed line reflects perfect calibration.
Figure 3: PCQR-1 predicted coverage probability lower bounds vs. time for 10 random episodes in the Starcraft 2 and Tamarisk domains. Each line corresponds to a separate episode.
Figure 4: Mean PCQR-1 predictions vs. time in the Starcraft 2 and Tamarisk domains. Means are computed across all episodes of all data partitions. Standard deviations are computed from the per-data-partition means across all episodes and are depicted by the semi-transparent regions.

Theorems & Definitions (7)

Theorem 1
proof
Definition 1
Lemma 1
proof
Theorem 2
proof

Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target

TL;DR

Abstract

Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)