Table of Contents
Fetching ...

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin

TL;DR

VideoWorld 2 tackles learning transferable knowledge for long-horizon tasks directly from unlabeled real-world videos. It introduces a dynamics-enhanced Latent Dynamics Model (dLDM) that offloads appearance modeling to a pretrained Video Diffusion Model (VDM), forcing latent codes to capture task-relevant dynamics, which are then autoregressively modeled to produce long-horizon policies. The approach achieves strong transfer across diverse real-world domains, demonstrating robust performance on Video-CraftBench and cross-domain gains when pretraining on Open-X before transferring to CALVIN, and it reports substantial visual quality improvements as well. The work highlights the potential of learning world knowledge directly from raw videos and provides open-source code, data, and models to spur further research.

Abstract

Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

TL;DR

VideoWorld 2 tackles learning transferable knowledge for long-horizon tasks directly from unlabeled real-world videos. It introduces a dynamics-enhanced Latent Dynamics Model (dLDM) that offloads appearance modeling to a pretrained Video Diffusion Model (VDM), forcing latent codes to capture task-relevant dynamics, which are then autoregressively modeled to produce long-horizon policies. The approach achieves strong transfer across diverse real-world domains, demonstrating robust performance on Video-CraftBench and cross-domain gains when pretraining on Open-X before transferring to CALVIN, and it reports substantial visual quality improvements as well. The work highlights the potential of learning world knowledge directly from raw videos and provides open-source code, data, and models to spur further research.

Abstract

Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
Paper Structure (22 sections, 12 figures, 4 tables)

This paper contains 22 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: (left) VideoWorld 2 explores how to learn transferable knowledge from unlabeled real-world videos. We construct a handicraft benchmark to evaluate the learned knowledge. (right) Comparison of different frameworks in success rate for long-horizon paper folding tasks in Video-CraftBench. We split the task into seven key steps and evaluate sequential success rates (detailed in Sec. \ref{['sec:data']}). VDM (e.g., Wan2.2 14B wan2025wan) produces high visual fidelity but fails to learn task-relevant dynamics or long-horizon policies. VideoWorld ren2025videoworld improves policy learning but suffers from poor visual quality in real-world scenarios. The bottom-right figure presents the failure cases of these baseline methods. VideoWorld 2 learns more robust latent dynamics while also achieving significantly better visual quality, enabling generalizable long-horizon knowledge learning from videos.
  • Figure 2: Qualitative Results. VideoWorld 2 learns transferable knowledge and generates long-horizon videos in unseen environments. This figure shows the output on long-horizon handicraft tasks.
  • Figure 3: Overview of the VideoWorld 2 model architecture. (Left) First, the dLDM compresses future visual changes into compact and generalizable latent codes. These codes are then modeled by an autoregressive transformer. (Right) In inference, the transformer predicts latent codes for a new, unseen environment from the input image, which are subsequently decoded into task execution videos.
  • Figure 4: The proposed dynamics-enhanced latent dynamics model (dLDM). (Left) Latent dynamic model in VideoWorld ren2025videoworld. Visual changes between the first and subsequent frames are compressed into a set of latent codes. (right) The dLDM proposed in VideoWorld 2. It employs a pre-trained VDM as an appearance prior, yielding better latent codes and facilitating high-fidelity video output.
  • Figure 5: Video clips with similar latent dynamic features. The text below represents the dynamic type.
  • ...and 7 more figures