Table of Contents
Fetching ...

Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation

Feichen Gan, Youcun Lu, Yingying Zhang, Yukun Liu

TL;DR

This work proposes a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling to mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts.

Abstract

Reliable uncertainty quantification is crucial for reinforcement learning (RL) in high-stakes settings. We propose a unified conformal prediction framework for infinite-horizon policy evaluation that constructs distribution-free prediction intervals {for returns} in both on-policy and off-policy settings. Our method integrates distributional RL with conformal calibration, addressing challenges such as unobserved returns, temporal dependencies, and distributional shifts. We propose a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling. These innovations mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts. Our theoretical analysis provides coverage guarantees that account for model misspecification and importance weight estimation. Empirical results, including experiments in synthetic and benchmark environments like Mountain Car, show that our method significantly improves coverage and reliability over standard distributional RL baselines.

Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation

TL;DR

This work proposes a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling to mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts.

Abstract

Reliable uncertainty quantification is crucial for reinforcement learning (RL) in high-stakes settings. We propose a unified conformal prediction framework for infinite-horizon policy evaluation that constructs distribution-free prediction intervals {for returns} in both on-policy and off-policy settings. Our method integrates distributional RL with conformal calibration, addressing challenges such as unobserved returns, temporal dependencies, and distributional shifts. We propose a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling. These innovations mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts. Our theoretical analysis provides coverage guarantees that account for model misspecification and importance weight estimation. Empirical results, including experiments in synthetic and benchmark environments like Mountain Car, show that our method significantly improves coverage and reliability over standard distributional RL baselines.

Paper Structure

This paper contains 41 sections, 78 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Pipeline of the proposed conformal policy prediction framework.
  • Figure 2: Coverage probability and average interval length at the 90% level for the proposed method with $k$-step pseudo-returns ($k = 1,\ldots,5$, from left to right) and DRL-QR (rightmost), under on-policy and off-policy settings in Example 1 (columns 1-2) and Example 2 (columns 3-4).
  • Figure 3: Coverage probability and average interval length at the 90% level for the proposed method with $\xi=0.5,0.6$ and $k=2,3$ (from left to right) and Foffano's method (rightmost).
  • Figure 4: Coverage probability and average interval length at the 90% level for the proposed method with $k$-step pseudo-returns ($k = 1,\ldots,5$, from left to right) and KD-QR (rightmost), under on-policy (left) and off-policy (right) settings in Example 3.
  • Figure 5: Coverage probability and average interval length at the 90% level for the proposed method with $k$-step pseudo-returns ($k = 1,\ldots,5$, from left to right) and DRL-QR (rightmost), under on-policy (left) and off-policy (right) settings in Example 4.

Theorems & Definitions (3)

  • proof
  • proof : Proof of Theorem 1
  • proof : Proof of Theorem 2