Table of Contents
Fetching ...

Offline RL Without Off-Policy Evaluation

David Brandfonbrener, William F. Whitney, Rajesh Ranganath, Joan Bruna

TL;DR

This work questions the central reliance on off-policy evaluation in offline RL by showing that a one-step policy improvement using the behavior policy's on-policy Q estimate $\widehat{Q}^{\beta}$ often surpasses iterative actor-critic methods on the D4RL benchmark. It introduces a unified offline approximate modified policy iteration (OAMPI) framework and analyzes several policy evaluation and improvement operators, with a focus on a simple yet robust one-step baseline. The key findings reveal that distribution shift and iterative error exploitation inherent to off-policy evaluation largely undermine iterative methods, while a conservative one-step update remains stable and competitive, though multiple steps can be advantageous under favorable data coverage. The results have practical significance for offline RL, suggesting that practitioners should establish strong one-step baselines before investing in more complex iterative algorithms, and pointing to future work on theoretical guarantees and data-coverage regimes.

Abstract

Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The one-step baseline achieves this strong performance while being notably simpler and more robust to hyperparameters than previously proposed iterative algorithms. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against those estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy.

Offline RL Without Off-Policy Evaluation

TL;DR

This work questions the central reliance on off-policy evaluation in offline RL by showing that a one-step policy improvement using the behavior policy's on-policy Q estimate often surpasses iterative actor-critic methods on the D4RL benchmark. It introduces a unified offline approximate modified policy iteration (OAMPI) framework and analyzes several policy evaluation and improvement operators, with a focus on a simple yet robust one-step baseline. The key findings reveal that distribution shift and iterative error exploitation inherent to off-policy evaluation largely undermine iterative methods, while a conservative one-step update remains stable and competitive, though multiple steps can be advantageous under favorable data coverage. The results have practical significance for offline RL, suggesting that practitioners should establish strong one-step baselines before investing in more complex iterative algorithms, and pointing to future work on theoretical guarantees and data-coverage regimes.

Abstract

Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The one-step baseline achieves this strong performance while being notably simpler and more robust to hyperparameters than previously proposed iterative algorithms. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against those estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy.

Paper Structure

This paper contains 51 sections, 2 theorems, 8 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

For any two policies $\pi$ and $\beta$,

Figures (11)

  • Figure 1: A cartoon illustration of the difference between one-step and multi-step methods. All algorithms constrain themselves to a neighborhood of "safe" policies around $\beta$. A one-step approach (left) only uses the on-policy$\widehat{Q}^\beta$, while a multi-step approach (right) repeatedly uses off-policy$\widehat{Q}^{\pi_i}$.
  • Figure 2: Learning curves and final performance on halfcheetah-medium across different algorithms and regularization hyperparameters (all using the reverse KL regularized improvement operator). Error bars show min and max over 3 seeds. Similar figures for other datasets from D4RL can be found in Appendix \ref{['sec:app_extra_exp']}.
  • Figure 3: Results of running the iterative algorithm on halfcheetah-medium. Each checkpointed policy is evaluated by a Q function trained from scratch on heldout data. MSE refers to $\mathop{\mathbb{E}}_{s,a\sim \beta}[(\hat{Q}^{\pi_i}(s,a) - Q^{\pi_i}(s,a))^2]$ and KL refers to $\mathop{\mathbb{E}}_{s\sim \beta}[KL(\pi(\cdot|s)\| \beta(\cdot|s)]$. Left: 90 policies taken from various points in training with various hyperaparmeters and random seeds. Center: MSE learning curves. Right: KL learning curves. Error bars show min and max over 3 random seeds.
  • Figure 4: An illustration of multi-step offline regularized policy iteration. The leftmost panel in each row shows the true reward (top) or error $\varepsilon_\beta$ (bottom). Then each subsequent panel plots $\pi_i$ (with arrow size proportional to $\pi_i(a|s)$) over either $Q^{\pi_i}$ (top) or $\widetilde{Q}^{\pi}_\beta$ (bottom), averaged over actions at each state. The one-step policy ($\pi_1$) has the highest value. The behavior policy here is a mixture of optimal $\pi^*$ and uniform $u$ with coefficient 0.2 so that $\beta = 0.2 \cdot \pi^* + 0.8 \cdot u$. We set $\alpha = 0.1$ as the regularization parameter for reverse KL regularization.
  • Figure 5: Histograms of overestimation error ($\widehat{Q}^{\pi_i}(s,a) - Q^{\pi_i}(s,a)$) on halfcheetah-medium with the iterative algorithm. Left: errors from the training Q function. Right: errors from an independently trained Q function.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Lemma 1: Performance difference, kakade2002approximately
  • Lemma 2: Conservative Policy Improvement, achiam2017constrained