Table of Contents
Fetching ...

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL

Xiyao Wang, Ruijie Zheng, Yanchao Sun, Ruonan Jia, Wichayaporn Wongkamjan, Huazhe Xu, Furong Huang

TL;DR

COPlanner addresses imperfect dynamics in model-based RL by introducing a planning-driven framework that combines conservative multi-step rollouts with optimistic environment exploration through an uncertainty-aware planning module. It relies on an ensemble-based uncertainty estimate to penalize risky rollout trajectories and to bonus informative real-environment exploration, enabling safer learning and targeted model expansion. The framework is plug-and-play, delivering substantial gains in sample efficiency and final performance when paired with strong baselines (MBPO and DreamerV3) across proprioceptive and visual tasks. Empirical results demonstrate robust improvements in data efficiency and asymptotic performance, highlighting COPlanner's practicality for more reliable and efficient planning in complex environments. Future work includes reducing planning overhead via parallelization and exploring alternative uncertainty estimators.

Abstract

Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL

TL;DR

COPlanner addresses imperfect dynamics in model-based RL by introducing a planning-driven framework that combines conservative multi-step rollouts with optimistic environment exploration through an uncertainty-aware planning module. It relies on an ensemble-based uncertainty estimate to penalize risky rollout trajectories and to bonus informative real-environment exploration, enabling safer learning and targeted model expansion. The framework is plug-and-play, delivering substantial gains in sample efficiency and final performance when paired with strong baselines (MBPO and DreamerV3) across proprioceptive and visual tasks. Empirical results demonstrate robust improvements in data efficiency and asymptotic performance, highlighting COPlanner's practicality for more reliable and efficient planning in complex environments. Future work includes reducing planning overhead via parallelization and exploring alternative uncertainty estimators.

Abstract

Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose , a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with .
Paper Structure (27 sections, 8 equations, 12 figures, 6 tables, 4 algorithms)

This paper contains 27 sections, 8 equations, 12 figures, 6 tables, 4 algorithms.

Figures (12)

  • Figure 1: Mean performance of COPlanner compared with baselines across 3 diverse benchmarks.
  • Figure 2: COPlanner Framework. The most essential part of COPlanner is the Uncertainty-aware Policy-Guided MPC (UP-MPC) phase in which we plan trajectories of length $H$, according to the learned dynamics model and learned policy network $\pi$, to select the action with highest trajectory reward. This UP-MPC phase is implemented differently for the two different purposes: environment exploration v.s. dynamics model rollouts. In environment exploration, trajectory reward has an uncertainty bonus term to encourage exploring uncertain regions in the environment. In dynamics model rollouts, trajectory reward, on the contrary, has an uncertainty penalty term to encourage policy learning on confident regions of the learned dynamics model.
  • Figure 3: The Planner.
  • Figure 4: Experiment results of COPlanner-MBPO and other three baselines on proprioceptive control environments. The curves in the first eight figures originate from DM Control tasks, while those in the last four are from MuJoCo tasks. The results are averaged over 8 random seeds, and shaded regions correspond to the $95\%$ confidence interval among seeds. During evaluation, for each seed of each method, we test for up to 1000 steps in the test environment and perform 10 evaluations to obtain an average value. The evaluation interval is every 1000 environment steps.
  • Figure 5: Experiment results of COPlanner-Dreamerv3 and other three baselines on pixel-input DMC. The results are averaged over 8 random seeds, and shaded regions correspond to the $95\%$ confidence interval among seeds. During evaluation, for each seed of each method, we test for up to 1000 steps in the test environment and perform 10 evaluations to obtain an average value. The evaluation interval is every 1000 environment steps.
  • ...and 7 more figures