Table of Contents
Fetching ...

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, Yilun Du

TL;DR

This work introduces Large Video Planner (LVP), a 14B video foundation model that uses internet-scale video data as the primary modality for embodied decision making. By encoding 49-frame clips into a latent space and applying latent diffusion with Diffusion Forcing and history-guided conditioning, LVP generates zero-shot video plans conditioned on scene observations and task texts, which are later retargeted to real robot actions. The authors curate LVP-1M, a diverse 1.4M-clip dataset, and demonstrate strong task-level generalization via third-party novel tasks and real-robot experiments across multiple morphologies, outperforming VLAs and video baselines on multi-stage and dexterous tasks. The work provides open-source model, dataset, and training code to advance reproducible video-based robot learning and planning research.

Abstract

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

Large Video Planner Enables Generalizable Robot Control

TL;DR

This work introduces Large Video Planner (LVP), a 14B video foundation model that uses internet-scale video data as the primary modality for embodied decision making. By encoding 49-frame clips into a latent space and applying latent diffusion with Diffusion Forcing and history-guided conditioning, LVP generates zero-shot video plans conditioned on scene observations and task texts, which are later retargeted to real robot actions. The authors curate LVP-1M, a diverse 1.4M-clip dataset, and demonstrate strong task-level generalization via third-party novel tasks and real-robot experiments across multiple morphologies, outperforming VLAs and video baselines on multi-stage and dexterous tasks. The work provides open-source model, dataset, and training code to advance reproducible video-based robot learning and planning research.

Abstract

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

Paper Structure

This paper contains 45 sections, 10 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Autonomous Robot Execution with Large Video Planner. Our approach uses video generation as a visual motion planner in pixel space. From a single image and a task instruction, the model generates a video depicting how the task should be completed. The predicted human motion is then retargeted to a robot hand for real-world execution, enabling zero-shot visual planning in diverse scenes.
  • Figure 2: LVP Overview:(a) Overview of the latent video diffusion framework. We first use a temporally causal VAE to encode video clips into compressed 3D latent representations. Then we train a diffusion transformer in this latent space with flow matching objectives. (b) We jointly train image-to-video (I2V) and video-to-video (V2V) with a modified diffusion forcing training strategy. During training, a random context length between $0$ and $6$ frames is selected, dividing the video into history and future segments. Two independent noise levels are applied to these segments, and the history segment is set to zero noise with a 50% probability. We visualize four representative cases of this noisy training strategy: the top row shows that longer contexts enable V2V training; the second row shows clean first-frame contexts, which exactly aligns with standard I2V training; and the botton two rows show noisy context frames, which improve robustness to out-of-distribution conditioning.
  • Figure 3: (a) Visualization of our eight dataset sources. First row: four robotics datasets. Second row: four human-centric datasets. (b) Illustration of our video diffusion sampling strategy, where scores estimated with and without history are linearly combined. Text conditioning and the diffusion transformer are omitted for clarity.
  • Figure 4: Pipeline from Video to Action. Given a generated video depicting a human hand performing a task, we first reconstruct and track the hand in 3D (second column). The reconstructed hand motion is then retargeted to dexterous hands or grippers (third column). Finally, the retargeted trajectory is transformed into the robot’s control frame and executed in the real world (rightmost column).
  • Figure 5: Baseline Comparison. LVP accurately generates videos of hand interactions in a zero-shot setting, such as pulling out a tissue (left) and opening a gate (right). Baseline models (Wan, Cosmos-Predict 2, Hunyuan) often produce spatial or semantic inconsistencies, highlighted by red circles. The first frame and task instruction shown under each column serve as the generation conditions.
  • ...and 11 more figures