Optimistic Model Rollouts for Pessimistic Offline Policy Optimization
Yuanzhao Zhai, Yiying Li, Zijian Gao, Xudong Gong, Kele Xu, Dawei Feng, Ding Bo, Huaimin Wang
TL;DR
This paper tackles offline model-based reinforcement learning by challenging the prevailing pessimistic optimization paradigm. It introduces ORPO, which decouples optimization of optimistic rollout policies in an Optimistic MDP (O-MDP) from the pessimistic policy optimization in a Pessimistic MDP (P-MDP), relabeling optimistic rollouts with penalized rewards before learning the final policy. The authors provide theoretical guarantees under linear MDP assumptions and demonstrate substantial empirical gains on the D4RL benchmark, notably improving generalization in tasks requiring OOD exploration. The approach offers a principled way to leverage dynamics-model generalization while maintaining stability through pessimistic updates, achieving state-of-the-art performance in several datasets. Overall, ORPO advances offline RL by integrating optimistic rollouts with pessimistic policy optimization to better utilize learned dynamics for generalization while controlling model errors.
Abstract
Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.
