SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation

Haruka Kiyohara; Ren Kishimoto; Kosuke Kawakami; Ken Kobayashi; Kazuhide Nakata; Yuta Saito

SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito

TL;DR

SCOPE-RL addresses the practical need for an end-to-end platform that combines offline RL and off-policy evaluation, with particular emphasis on robust OPE modules. It integrates a comprehensive suite of OPE estimators, including cumulative distribution OPE (CD-OPE), and implements risk-aware evaluation-of-OPE metrics to inform policy deployment decisions. The library supports end-to-end workflows from data collection to offline policy learning, OPE, and OPS, with broad compatibility for Gym/Gymnasium and d3rlpy, plus visualization and documentation to aid researchers and practitioners. This work enables more reliable policy evaluation and safer, more efficient downstream policy selection, offering a flexible benchmarking environment for offline RL and OPE research. The authors also outline future directions to extend CD-OPE, partially observable settings, and automated estimator selection, signaling ongoing development and community engagement.

Abstract

This paper introduces SCOPE-RL, a comprehensive open-source Python software designed for offline reinforcement learning (offline RL), off-policy evaluation (OPE), and selection (OPS). Unlike most existing libraries that focus solely on either policy learning or evaluation, SCOPE-RL seamlessly integrates these two key aspects, facilitating flexible and complete implementations of both offline RL and OPE processes. SCOPE-RL put particular emphasis on its OPE modules, offering a range of OPE estimators and robust evaluation-of-OPE protocols. This approach enables more in-depth and reliable OPE compared to other packages. For instance, SCOPE-RL enhances OPE by estimating the entire reward distribution under a policy rather than its mere point-wise expected value. Additionally, SCOPE-RL provides a more thorough evaluation-of-OPE by presenting the risk-return tradeoff in OPE results, extending beyond mere accuracy evaluations in existing OPE literature. SCOPE-RL is designed with user accessibility in mind. Its user-friendly APIs, comprehensive documentation, and a variety of easy-to-follow examples assist researchers and practitioners in efficiently implementing and experimenting with various offline RL methods and OPE estimators, tailored to their specific problem contexts. The documentation of SCOPE-RL is available at https://scope-rl.readthedocs.io/en/latest/.

SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation

TL;DR

Abstract

Paper Structure (42 sections, 23 equations, 16 figures, 2 tables)

This paper contains 42 sections, 23 equations, 16 figures, 2 tables.

Introduction
Overview of SCOPE-RL
Implemented OPE estimators and evaluation-of-OPE metrics
Key Feature 1: Cumulative distribution OPE
Preliminaries.
Off-Policy Evaluation (OPE).
Cumulative Distribution Off-Policy Evaluation (CD-OPE).
Key Feature 2: Comprehensive evaluation-of-OPE metrics
Background.
Evaluation of OPE.
User-friendly APIs, visualization tools, and documentation
Summary and Future Work
Details of implemented OPE estimators and assessment metrics
Standard Off-Policy Evaluation
Direct Method (DM)
...and 27 more sections

Figures (16)

Figure 1: End-to-end workflow of offline RL and OPE with SCOPE-RL.
Figure 2: Summarizing the distinctive features of SCOPE-RL. OPE: While existing packages (e.g., fu2021benchmarksvoloshin2019empirical) focus only on estimating the expected performance in a point-wise manner (left), SCOPE-RL additionally supports cumulative distribution OPEchandak2021universalhuang2021offhuang2022off to estimate the whole distribution of policy performance (right). Evaluation-of-OPE: While existing package reports only the "accuracy" of OPE or that of the downstream policy selection tasks kiyohara2023towards (left), SCOPE-RL also measures various risk-return tradeoff metrics in top-$k$ policy selection (right) (See Section \ref{['sec:assessments']} for the details). Visualization: Finally, all figures, including those illustrating the properties of existing packages, are generated by the visualization tools implemented in SCOPE-RL.
Figure 3: (Top) Example code of estimating the CDF with CD-OPE estimators implemented in SCOPE-RL. (Bottom) The output of the visualization function of the CD-OPE module.
Figure 4: Practical workflow of policy evaluation and selection involves OPE as a screening process where an OPE estimator ($\hat{J}$) chooses top-$k$ (shortlisted) candidate policies that are to be tested in online A/B tests, where $k$ is a pre-defined online evaluation budget. A policy that is identified as the best policy based on the online evaluation process will be chosen as the production policy ($\hat{\pi}^*$). (Credit of the figure and description: kiyohara2023towards)
Figure 5: (Top) Example code to perform evaluation-of-OPE with SharpRatio@k and other statistics of top-$k$ policy portfolio using SCOPE-RL. (Bottom) Visualizing the evaluation-of-OPE results.
...and 11 more figures

SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation

TL;DR

Abstract

SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (16)