Bootstrap Off-policy with World Model

Guojian Zhan; Likun Wang; Xiangteng Zhang; Jiaxin Gao; Masayoshi Tomizuka; Shengbo Eben Li

Bootstrap Off-policy with World Model

Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, Shengbo Eben Li

TL;DR

BOOM tackles the core problem of actor divergence when combining online planning with off-policy RL by introducing bootstrap alignment with a world model. It uses a likelihood-free alignment loss and a soft value-weighted mechanism to align a parametric policy with a non-parametric planner, while the world model provides predictive trajectories and value targets to guide improvement. The authors provide theoretical bounds showing how controlling the KL divergence between planner and policy bounds return and Q-value gaps, and demonstrate state-of-the-art performance and stability on the DeepMind Control Suite and Humanoid-Bench. The approach achieves strong sample efficiency and robust final performance across high-dimensional continuous control tasks, with open-source code to facilitate reproducibility and adoption in planning-driven MBRL settings.

Abstract

Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy's actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner's non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner's action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL.

Bootstrap Off-policy with World Model

TL;DR

Abstract

Bootstrap Off-policy with World Model

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (14)