When is Offline Policy Selection Sample Efficient for Reinforcement Learning?
Vincent Liu, Prabhat Nagarajan, Andrew Patterson, Martha White
TL;DR
This work investigates when offline policy selection (OPS) can be sample-efficient by relating OPS to off-policy evaluation (OPE) and Bellman error (BE) estimation. It proves that OPS inherits the worst-case hardness of OPE, implying exponential sample complexity without structural assumptions, and outlines conditions under which BE-based selection can be more efficient. The authors introduce Identifiable BE Selection (IBES), a BE-based OPS algorithm that uses cross-validation to pick hyperparameters, and show, through experiments on classic control tasks and Atari data, that IBES often outperforms other BE-based or OPE-centric approaches under favorable data coverage. The findings emphasize the necessity of data-coverage and model-class assumptions to enable practical, sample-efficient OPS in offline RL. Overall, the paper provides a theoretical foundation for OPS limits and a practical BE-based method with empirical validation on challenging benchmarks.
Abstract
Offline reinforcement learning algorithms often require careful hyperparameter tuning. Before deployment, we need to select amongst a set of candidate policies. However, there is limited understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then connect BE estimation to the OPS problem, showing how BE can be used as a tool for OPS. While BE-based methods generally require stronger requirements than OPE, when those conditions are met they can be more sample efficient. Building on this insight, we propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.
