Table of Contents
Fetching ...

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Cameron Voloshin, Hoang M. Le, Nan Jiang, Yisong Yue

TL;DR

Off-policy policy evaluation (OPE) in reinforcement learning is sensitive to data distribution shifts and horizon effects. The paper introduces COOBS, a modular, reproducible benchmarking suite that stress-tests IPS, DM, and HM estimators across eight diverse environments and varying data-generating conditions. It provides a practical method-selection guideline and reveals that no single estimator dominates; performance depends on horizon, policy divergence, stochasticity, and representation. These insights enable informed method choice for safety-critical RL deployments and the open-source COBS platform to accelerate future OPE research.

Abstract

We offer an experimental benchmark and empirical study for off-policy policy evaluation (OPE) in reinforcement learning, which is a key problem in many safety critical applications. Given the increasing interest in deploying learning-based methods, there has been a flurry of recent proposals for OPE method, leading to a need for standardized empirical analyses. Our work takes a strong focus on diversity of experimental design to enable stress testing of OPE methods. We provide a comprehensive benchmarking suite to study the interplay of different attributes on method performance. We distill the results into a summarized set of guidelines for OPE in practice. Our software package, the Caltech OPE Benchmarking Suite (COBS), is open-sourced and we invite interested researchers to further contribute to the benchmark.

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

TL;DR

Off-policy policy evaluation (OPE) in reinforcement learning is sensitive to data distribution shifts and horizon effects. The paper introduces COOBS, a modular, reproducible benchmarking suite that stress-tests IPS, DM, and HM estimators across eight diverse environments and varying data-generating conditions. It provides a practical method-selection guideline and reveals that no single estimator dominates; performance depends on horizon, policy divergence, stochasticity, and representation. These insights enable informed method choice for safety-critical RL deployments and the open-source COBS platform to accelerate future OPE research.

Abstract

We offer an experimental benchmark and empirical study for off-policy policy evaluation (OPE) in reinforcement learning, which is a key problem in many safety critical applications. Given the increasing interest in deploying learning-based methods, there has been a flurry of recent proposals for OPE method, leading to a need for standardized empirical analyses. Our work takes a strong focus on diversity of experimental design to enable stress testing of OPE methods. We provide a comprehensive benchmarking suite to study the interplay of different attributes on method performance. We distill the results into a summarized set of guidelines for OPE in practice. Our software package, the Caltech OPE Benchmarking Suite (COBS), is open-sourced and we invite interested researchers to further contribute to the benchmark.

Paper Structure

This paper contains 48 sections, 16 equations, 43 figures, 25 tables.

Figures (43)

  • Figure 1: Depicting one of the dimensions which COBS provides control. For the Mountain Car environment, we can select either a tabular, standard coordinate-based, or pixel-based representation of the state while holding other factors fixed.
  • Figure 2: General Guideline Decision Tree.
  • Figure 3: Left: (Graph domain) Comparing IPS (and IH) under short and long horizon. Mild policy mismatch setting. PDWIS is often best among IPS. But IH outperforms in long horizon. Center: (Pixel-MC) Comparing direct methods in high-dimensional, long horizon setting. Relatively large policy mismatch. FQE and IH tend to outperform. AM is significantly worse in complex domains. Retrace($\lambda$), Q($\lambda$) and Tree-Backup($\lambda$) are very computationally expensive and thus excluded. Right: (Top Methods) The top 5 methods which perform the best across all conditions and domains.
  • Figure 4: Comparing IPS versus Direct methods versus Hybrid methods under short and long horizon, large policy mismatch and large data. Left: (Graph domain) Deterministic environment. Center: (Graph domain) Stochastic environment and rewards. Right: (Graph-POMDP) Model misspecification (POMDP). Minimum error per class is shown.
  • Figure 5: (Gridworld domain) Errors are directly correlated with policy mismatch but inversely correlated with data size. We pick the best direct methods for illustration. The two plots represent the same figure from two different vantage points.
  • ...and 38 more figures