When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Vincent Liu; Prabhat Nagarajan; Andrew Patterson; Martha White

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Vincent Liu, Prabhat Nagarajan, Andrew Patterson, Martha White

TL;DR

This work investigates when offline policy selection (OPS) can be sample-efficient by relating OPS to off-policy evaluation (OPE) and Bellman error (BE) estimation. It proves that OPS inherits the worst-case hardness of OPE, implying exponential sample complexity without structural assumptions, and outlines conditions under which BE-based selection can be more efficient. The authors introduce Identifiable BE Selection (IBES), a BE-based OPS algorithm that uses cross-validation to pick hyperparameters, and show, through experiments on classic control tasks and Atari data, that IBES often outperforms other BE-based or OPE-centric approaches under favorable data coverage. The findings emphasize the necessity of data-coverage and model-class assumptions to enable practical, sample-efficient OPS in offline RL. Overall, the paper provides a theoretical foundation for OPS limits and a practical BE-based method with empirical validation on challenging benchmarks.

Abstract

Offline reinforcement learning algorithms often require careful hyperparameter tuning. Before deployment, we need to select amongst a set of candidate policies. However, there is limited understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then connect BE estimation to the OPS problem, showing how BE can be used as a tool for OPS. While BE-based methods generally require stronger requirements than OPE, when those conditions are met they can be more sample efficient. Building on this insight, we propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

TL;DR

Abstract

Paper Structure (35 sections, 8 theorems, 8 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 35 sections, 8 theorems, 8 equations, 7 figures, 1 table, 2 algorithms.

Introduction
Background
On the Sample Complexity of OPS
OPE as Subroutine for OPS
OPS is not Easier than OPE
Implication: We Need Assumptions for Sample Efficient OPS
Bellman Error Selection for OPS
BE Selection Problem
A Sound BE Selection Algorithm with Cross-Validation
Comparison to Existing Methods
Comparison between BE and OPE for OPS
Experiments
Comparison between BE-based Methods
Comparison between FQE and IBES under Different Data Coverage
Comparison between FQE and IBES on Atari Datasets
...and 20 more sections

Key Result

Theorem 1

Given an MDP $M$, a data distribution $d_b$, and a set of policies $\Pi$, suppose that, for any pair $(\varepsilon,\delta)$, there exists an $(\varepsilon,\delta)$-sound OPE algorithm ${\mathcal{L}}$ on any OPE instance $I\in\{(M,d_b,\pi):\pi\in\Pi\}$ with a sample size at most $O(\mathrm{N}_{OPE}(S

Figures (7)

Figure 1: Correlation between true performance and estimated performance on Atari datasets. We generate 90 policies offline using the CQL algorithm with different choices for two hyperparameters: the number of training steps and the conservative parameter, and evaluate these policies using FQE on 5 different datasets. Each point in the scatter plot corresponds to a (policy, evaluation dataset) pair. The x-axis is the actual policy performance for that policy and y-axis is the estimated policy performance for that policy using that evaluation dataset. Colors represent different evaluation datasets. The Kendall rank correlation coefficient is shown in the title of the plot. If the FQE estimates accurately rank policies, we expect to see a strong linear relationship and a Kendall rank correlation coefficient close to $1$. Neither of these behaviors are seen here, and it is clear FQE does not provide an effective mechanism to rank policies.
Figure 2: The offline RL pipeline with OPS. In the policy training phase, $n$ algorithm-hyperparameter pairs are trained on an offline dataset to produce $n$ candidate policies. An OPS algorithm then takes as input these $n$ policies, and again utilizing offline data (potentially a validation dataset), select a final policy.
Figure 3: Visual depiction of the reduction of OPE to OPS. Given a MDP $M$ and a target policy $\pi$, we can construct a new MDP $M'$ and two candidate policies $\{\pi_1,\pi_2\}$ for OPS, as shown in (a). The MDP construction was first mentioned in wang2020statistical. $\pi_1$ chooses $a_1$ in $s_0$, which leads to a terminal state $s_1$, and can arbitrarily select actions in other states. $\pi_2$ chooses $a_2$ and is otherwise identical to the target policy $\pi$. Figure (b) describes the search procedure to find the policy value by calling the OPS subroutine. When the OPS query returns $\pi_1$, we follow the green arrow. When the OPS query returns $\pi_2$, we follow the blue arrow. We can keep searching for the true policy value by setting $r$ for the OPS query, until the desired precision is reached.
Figure 4: Comparison between BE methods. The figure shows the normalized top-$1$ regret with varying sample size, averaged over 10 runs with one standard error. IBES consistently achieves the lowest regret across environments.
Figure 5: Comparison to BE with a fixed number of hidden units. The figure shows the normalized top-$1$ regret averaged over 10 runs with one standard error. IBES with model selection consistently achieves the lowest regret across environments.
...and 2 more figures

Theorems & Definitions (11)

Definition 1: $(\varepsilon,\delta)$-sound OPS algorithm
Definition 2: $(\varepsilon,\delta)$-sound OPE algorithm
Theorem 1: Upper bound on sample complexity of OPS
Theorem 2: Lower bound on sample complexity of OPS
Corollary 1: Lower bound on the sample complexity of OPS
Definition 3: $(\varepsilon,\delta)$-sound BE selection
Theorem 2: Upper bound on sample complexity of OPS
Theorem 2: Lower bound on sample complexity of OPS
Corollary 1: Lower bound on the sample complexity of OPS
Lemma A.2: Lemma 3.2 of duan2021risk
...and 1 more

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

TL;DR

Abstract

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (11)