Table of Contents
Fetching ...

Conformal Off-Policy Evaluation in Markov Decision Processes

Daniele Foffano, Alessio Russo, Alexandre Proutiere

TL;DR

The paper tackles offline off-policy evaluation in finite-horizon MDPs by building conformal prediction intervals that guarantee the target policy value with a user-specified confidence. It develops a weighted CP framework to compensate distribution shift between behavior and target policies and introduces asymmetric score-based refinements (double-quantile and shifted-values) to center predictions toward the target policy. It provides practical offline-estimation strategies for likelihood ratios, including Monte-Carlo, empirical, and gradient approaches, and demonstrates that the proposed CP methods yield shorter, well-calibrated intervals compared to standard baselines in an inventory control setup. The study highlights the distribution-free, non-asymptotic guarantees of CP in OPE and points to future work on more scalable likelihood-ratio estimation and broader empirical validation.

Abstract

Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.

Conformal Off-Policy Evaluation in Markov Decision Processes

TL;DR

The paper tackles offline off-policy evaluation in finite-horizon MDPs by building conformal prediction intervals that guarantee the target policy value with a user-specified confidence. It develops a weighted CP framework to compensate distribution shift between behavior and target policies and introduces asymmetric score-based refinements (double-quantile and shifted-values) to center predictions toward the target policy. It provides practical offline-estimation strategies for likelihood ratios, including Monte-Carlo, empirical, and gradient approaches, and demonstrates that the proposed CP methods yield shorter, well-calibrated intervals compared to standard baselines in an inventory control setup. The study highlights the distribution-free, non-asymptotic guarantees of CP in OPE and points to future work on more scalable likelihood-ratio estimation and broader empirical validation.

Abstract

Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.
Paper Structure (30 sections, 5 theorems, 38 equations, 4 figures, 1 algorithm)

This paper contains 30 sections, 5 theorems, 38 equations, 4 figures, 1 algorithm.

Key Result

Proposition 1

Under assumption:assumption1, for any score function $s$ and any $\alpha \in (0,1)$, where $\mathbb{P}^{\pi^b,\pi}$ accounts for the randomness of $(X, Y)\sim P_{X, Y}^{\pi}$ and that of the data $\mathcal{D}_{cal} = \{X_i,Y_i\}_{i=1}^n$ (with for all $i\in [n]$, $(X_i,Y_i)\sim P_{X,Y}^{\pi^b}$).

Figures (4)

  • Figure 1: Conformal prediction for off-policy evaluation. The dataset $\mathcal{D}$ is collected using a behavior policy $\pi^b$, which is then split into the training$\mathcal{D}_{tr}$ and calibration$\mathcal{D}_{cal}$ datasets. When evaluating a different policy $\pi$, there is a shift in the data distribution, and we need to learn a likelihood ratios $\hat{w}$ to compensate for this shift. The training data is used to learn estimates of the weights $\hat{w}$ and a model $\hat{f}$ used in the computation of the scores. The estimated weights are used as plug-in estimates to re-weight the cumulative distribution function of the scores $\hat{F}_n^{x,y}$, which is then used to compute the conformalized intervals $\hat{C}_n(x)$.
  • Figure 2: Symmetry problem. For the original confidence set with one single quantile, and score function $s(x,y)=\max(q_{\alpha_{\text{lo}}}(x) - y, y - q_{\alpha_{\text{hi}}}(x))$, we obtain a set that is symmetric around its middle point $(q_{\alpha_{\text{lo}}}(x) + q_{\alpha_{\text{hi}}}(x))/2$. We can break this symmetry by considering two different score quantiles, one for $q_{\alpha_{\text{lo}}}(x) - y$ and one for $y - q_{\alpha_{\text{hi}}}(x)$, thus leading to a less conservative conformalized set.
  • Figure 3: An example of the difference $M-m$ for the case of a convex mixture, with $|{\@fontswitch\mathcal{A}}|=10, H=40$ and $\epsilon^b=0.4$.
  • Figure 4: Results for the inventory control problem for $H=20,40$, with target coverage of $90\%$. The policy $\pi^b$ is $\epsilon^b-$greedy w.r.t. $\pi^\star$ (an optimal discounted policy with discount factor $\gamma=0.99$), with $\epsilon^b = 0.4$. We evaluated a target policy $\pi$ that is $\epsilon$-greedy w.r.t. $\pi^\star$, with varying $\epsilon$. The four plots on the left are the results corresponding to the first instance of the Inventory Problem, while on the right we present the results for the second instance (both described in section \ref{['sec:Environment']}). The boxplots show average conformalized intervals for the various methods (whiskers indicate $95\%$ confidence intervals for the minimum and the maximum). The line plots depict the obtained coverage level (bars indicate $95\%$ confidence intervals).

Theorems & Definitions (10)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Proposition 5
  • proof