Table of Contents
Fetching ...

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

Yuanzhao Zhai, Yiying Li, Zijian Gao, Xudong Gong, Kele Xu, Dawei Feng, Ding Bo, Huaimin Wang

TL;DR

This paper tackles offline model-based reinforcement learning by challenging the prevailing pessimistic optimization paradigm. It introduces ORPO, which decouples optimization of optimistic rollout policies in an Optimistic MDP (O-MDP) from the pessimistic policy optimization in a Pessimistic MDP (P-MDP), relabeling optimistic rollouts with penalized rewards before learning the final policy. The authors provide theoretical guarantees under linear MDP assumptions and demonstrate substantial empirical gains on the D4RL benchmark, notably improving generalization in tasks requiring OOD exploration. The approach offers a principled way to leverage dynamics-model generalization while maintaining stability through pessimistic updates, achieving state-of-the-art performance in several datasets. Overall, ORPO advances offline RL by integrating optimistic rollouts with pessimistic policy optimization to better utilize learned dynamics for generalization while controlling model errors.

Abstract

Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

TL;DR

This paper tackles offline model-based reinforcement learning by challenging the prevailing pessimistic optimization paradigm. It introduces ORPO, which decouples optimization of optimistic rollout policies in an Optimistic MDP (O-MDP) from the pessimistic policy optimization in a Pessimistic MDP (P-MDP), relabeling optimistic rollouts with penalized rewards before learning the final policy. The authors provide theoretical guarantees under linear MDP assumptions and demonstrate substantial empirical gains on the D4RL benchmark, notably improving generalization in tasks requiring OOD exploration. The approach offers a principled way to leverage dynamics-model generalization while maintaining stability through pessimistic updates, achieving state-of-the-art performance in several datasets. Overall, ORPO advances offline RL by integrating optimistic rollouts with pessimistic policy optimization to better utilize learned dynamics for generalization while controlling model errors.

Abstract

Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.
Paper Structure (41 sections, 7 theorems, 39 equations, 9 figures, 7 tables, 3 algorithms)

This paper contains 41 sections, 7 theorems, 39 equations, 9 figures, 7 tables, 3 algorithms.

Key Result

Proposition 1

Under the assumption of linear MDPs, the uncertainty of dynamics models can form a UCB bonus.

Figures (9)

  • Figure 1: (a) Previous model-based offline RL generates model rollouts and optimizes the policy within the P-MDP. (b) We decouple the training of optimistic rollout policies from the pessimistic policy optimization.
  • Figure 2: (a) In toy experiments with a 2-dimensional continuous state space and action space, the coordinate origin (0, 0) is taken as the central point of the square region. The agent starts at the region between lines $y=-x-0.25$ and $y=-x+0.25$, and the goal is to move upper right to obtain high rewards. The offline dataset only contains transitions whose state is in the initial area. (b) The further the states are from the offline dataset, the higher the estimated uncertainty value by the dynamics model. (c) The policy trained with MOPO MOPO with only P-MDP can not reach regions with high reward but high uncertainty. (d) With more optimistic model rollouts but optimization in the same P-MDP, ORPO agents can learn to reach states with high rewards and avoid regions with low rewards. Please refer to Appendix \ref{['exp-toy']} for the detailed experimental setup.
  • Figure 3: A case study of methods using model MDP (MBPO), P-MDP (MOPO), and O-MDP (OROO) on "Halfcheetah-medium-v2" datasets over 5 different seeds. Left: Histograms of distances between actions from different model rollouts and the offline dataset. Right: Learning curves of different methods.
  • Figure 4: Learning curves of rollout policies and corresponding output policies in two datasets over 5 different seeds.
  • Figure 5: Learning curves of OROO, MOPO, and ORPO over 5 different seeds on "Halfcheetah-medium-expert-v2". We report the expected discounted returns in the model MDP $\eta_{\widehat{M}}( \pi)$ and the real MDP $\eta_{M}( \pi)$ as well as the average model error $\epsilon_u( \pi)$.
  • ...and 4 more figures

Theorems & Definitions (13)

  • Proposition 1
  • Theorem 1
  • Proposition 2
  • proof
  • Definition 1: $\xi$-Uncertainty Quantifier
  • Lemma 1: pevi-2021
  • Proposition 3
  • proof
  • Lemma 2
  • proof
  • ...and 3 more