Table of Contents
Fetching ...

Strengthening Generative Robot Policies through Predictive World Modeling

Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, Heng Yang

TL;DR

The paper addresses robustness gaps in robotic control by unifying a diffusion-based generative policy with a predictive, action-conditioned world model, enabling online planning. GPC builds three components—policy learning, world-model learning from expert and exploration data, and two online planners (GPC-RANK and GPC-OPT)—and optionally uses a reward predictor or zero-shot vision-language models for task guidance. Across state-based and vision-based tasks in simulation and real-world settings, GPC significantly outperforms pure behavior cloning, with gains amplified by multiple action proposals and gradient-based optimization, and sometimes approaching a ground-truth simulator's planning performance. The work demonstrates that random exploration data is crucial for world-model accuracy and shows the approach can leverage foundation-models to handle diverse rewards and visual dynamics, marking a practical step toward robust, flexible robotic control.

Abstract

We present generative predictive control (GPC), a learning control framework that (i) clones a generative diffusion-based policy from expert demonstrations, (ii) trains a predictive action-conditioned world model from both expert demonstrations and random explorations, and (iii) synthesizes an online planner that ranks and optimizes the action proposals from (i) by looking ahead into the future using the world model from (ii). Across a variety of robotic manipulation tasks, we demonstrate that GPC consistently outperforms behavior cloning in both state-based and vision-based settings, in simulation and in the real world.

Strengthening Generative Robot Policies through Predictive World Modeling

TL;DR

The paper addresses robustness gaps in robotic control by unifying a diffusion-based generative policy with a predictive, action-conditioned world model, enabling online planning. GPC builds three components—policy learning, world-model learning from expert and exploration data, and two online planners (GPC-RANK and GPC-OPT)—and optionally uses a reward predictor or zero-shot vision-language models for task guidance. Across state-based and vision-based tasks in simulation and real-world settings, GPC significantly outperforms pure behavior cloning, with gains amplified by multiple action proposals and gradient-based optimization, and sometimes approaching a ground-truth simulator's planning performance. The work demonstrates that random exploration data is crucial for world-model accuracy and shows the approach can leverage foundation-models to handle diverse rewards and visual dynamics, marking a practical step toward robust, flexible robotic control.

Abstract

We present generative predictive control (GPC), a learning control framework that (i) clones a generative diffusion-based policy from expert demonstrations, (ii) trains a predictive action-conditioned world model from both expert demonstrations and random explorations, and (iii) synthesizes an online planner that ranks and optimizes the action proposals from (i) by looking ahead into the future using the world model from (ii). Across a variety of robotic manipulation tasks, we demonstrate that GPC consistently outperforms behavior cloning in both state-based and vision-based settings, in simulation and in the real world.

Paper Structure

This paper contains 20 sections, 5 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Generative Predictive Control.
  • Figure 2: Diffusion-based Visual World Modeling. Our visual world model consists of a sequence of recursive single-step image predictors. Each single-step image predictor is designed as a conditional diffusion process.
  • Figure 3: World model prediction in GPC for simulation tasks.
  • Figure 4: Comparison of different visual world modeling on two examples. (Highlights for comparison are boxed in blue).
  • Figure 5: Evaluation Trajectories for Triangle Drawing and Cube & Sphere Swap.
  • ...and 11 more figures

Theorems & Definitions (1)

  • Remark 1: Freeze the Noise