DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Emily Yue-Ting Jia; Weiduo Yuan; Tianheng Shi; Vitor Guizilini; Jiageng Mao; Yue Wang

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Emily Yue-Ting Jia, Weiduo Yuan, Tianheng Shi, Vitor Guizilini, Jiageng Mao, Yue Wang

Abstract

Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We demonstrate that this sub-optimal data is sufficient to train an action-conditioned video generation model, which implicitly captures complex real-world physics. Subsequently, the VLM planner is fine-tuned entirely within the "imagination" of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task-specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large-scale real-world data collection. Our project page is https://psi-lab.ai/DreamPlan/.

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Abstract

Paper Structure (20 sections, 5 equations, 6 figures, 3 tables)

This paper contains 20 sections, 5 equations, 6 figures, 3 tables.

INTRODUCTION
RELATED WORK
Vision-Language(-Action) Models for Robotics
World Models for Robotics
METHOD
VLM Planner for Deformable Manipulation
Learning World Model for VLM Fine-Tuning
Reinforcement Fine-Tuning of VLM Planner with World Model
EXPERIMENTS
Real-world experiments
Hardware Setup
Automated Data Collection
Evaluation Protocol
Video World Models Serve as Reliable Verifiers
Qualitative Comparison
...and 5 more sections

Figures (6)

Figure 1: We propose DreamPlan, a highly efficient framework that adapts vision-language (VLM) planners to real-world physics via virtual rollouts generated by world models.DreamPlan learns an action-conditioned video world model that captures task-specific deformable dynamics from exploratory interaction data collected by the zero-shot VLM, and leverages it to fine-tune the planner entirely offline. We demonstrate DreamPlan's effectiveness on three challenging deformable manipulation tasks—cloth, rope, and soft toy manipulation—where DreamPlan significantly outperforms zero-shot baselines.
Figure 2: Overview of DreamPlan. Our framework consists of three stages. (1) Zero-shot proposal: given the current observation and goal image, a pretrained VLM planner generates multiple candidate keypoint-based manipulation actions. (2) World model learning: these zero-shot actions are executed to collect diverse action–observation trajectories, which are used to fine-tune an action-conditioned diffusion world model that predicts object deformation outcomes from rendered robot-motion videos. (3) World-model-guided alignment: the trained world model acts as a verifier to evaluate sampled VLM actions by predicting their future outcomes; comparing predicted outcomes yields pairwise preferences (more vs. less goal-consistent actions), which are used to fine-tune the VLM planner via Odds Ratio Policy Optimization (ORPO), aligning it toward physics-consistent behaviors without additional real-world interaction.
Figure 3: Qualitative comparison of video generation results. Baselines fail to follow the specified actions or produce unrealistic deformations, while our method generates deformations that is both action-consistent and physically plausible, demonstrating its reliability as a verifier for VLM fine-tuning.
Figure 4: Hardware setup. Our real-world platform consists of two Franka FR3 arms positioned opposite each other to enable bimanual manipulation within a shared workspace. A top-mounted RealSense D435i camera captures the interaction area, where deformable objects are placed on the work surface for automated data collection and further manipulation.
Figure 5: Success rate improvements via RL fine-tuning. Our RL pipeline improves success rates by 15%--40% over zero-shot baselines across all evaluated real-world tasks.
...and 1 more figures

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Abstract

DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

Authors

Abstract

Table of Contents

Figures (6)