Offline RL Without Off-Policy Evaluation
David Brandfonbrener, William F. Whitney, Rajesh Ranganath, Joan Bruna
TL;DR
This work questions the central reliance on off-policy evaluation in offline RL by showing that a one-step policy improvement using the behavior policy's on-policy Q estimate $\widehat{Q}^{\beta}$ often surpasses iterative actor-critic methods on the D4RL benchmark. It introduces a unified offline approximate modified policy iteration (OAMPI) framework and analyzes several policy evaluation and improvement operators, with a focus on a simple yet robust one-step baseline. The key findings reveal that distribution shift and iterative error exploitation inherent to off-policy evaluation largely undermine iterative methods, while a conservative one-step update remains stable and competitive, though multiple steps can be advantageous under favorable data coverage. The results have practical significance for offline RL, suggesting that practitioners should establish strong one-step baselines before investing in more complex iterative algorithms, and pointing to future work on theoretical guarantees and data-coverage regimes.
Abstract
Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The one-step baseline achieves this strong performance while being notably simpler and more robust to hyperparameters than previously proposed iterative algorithms. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against those estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy.
