Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Cameron Voloshin; Hoang M. Le; Nan Jiang; Yisong Yue

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

Cameron Voloshin, Hoang M. Le, Nan Jiang, Yisong Yue

TL;DR

Off-policy policy evaluation (OPE) in reinforcement learning is sensitive to data distribution shifts and horizon effects. The paper introduces COOBS, a modular, reproducible benchmarking suite that stress-tests IPS, DM, and HM estimators across eight diverse environments and varying data-generating conditions. It provides a practical method-selection guideline and reveals that no single estimator dominates; performance depends on horizon, policy divergence, stochasticity, and representation. These insights enable informed method choice for safety-critical RL deployments and the open-source COBS platform to accelerate future OPE research.

Abstract

We offer an experimental benchmark and empirical study for off-policy policy evaluation (OPE) in reinforcement learning, which is a key problem in many safety critical applications. Given the increasing interest in deploying learning-based methods, there has been a flurry of recent proposals for OPE method, leading to a need for standardized empirical analyses. Our work takes a strong focus on diversity of experimental design to enable stress testing of OPE methods. We provide a comprehensive benchmarking suite to study the interplay of different attributes on method performance. We distill the results into a summarized set of guidelines for OPE in practice. Our software package, the Caltech OPE Benchmarking Suite (COBS), is open-sourced and we invite interested researchers to further contribute to the benchmark.

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

TL;DR

Abstract

Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (43)