Table of Contents
Fetching ...

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel, Ming-Yu Liu, Yuke Zhu, Joel Jang, Linxi "Jim" Fan

TL;DR

DreamDojo presents a foundation world model for open-world dexterous robotics by pretraining on 44k hours of egocentric human videos and introducing continuous latent actions as unified proxy labels. It combines a diffusion-based latent video backbone (Cosmos-Predict2.5) with a latent-action VAE, followed by target-robot post-training and a Self-Forcing distillation to autoregressive, real-time inference at 10.81 FPS. The approach yields strong out-of-distribution generalization, robust action controllability, and practical downstream benefits such as policy evaluation, live teleoperation, and model-based planning, supported by extensive benchmarks and human evaluations. This work demonstrates a scalable path toward general-purpose robot world models capable of transferring knowledge from humans to diverse robotic embodiments and tasks, while highlighting areas for broader action coverage and speed improvements.

Abstract

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

TL;DR

DreamDojo presents a foundation world model for open-world dexterous robotics by pretraining on 44k hours of egocentric human videos and introducing continuous latent actions as unified proxy labels. It combines a diffusion-based latent video backbone (Cosmos-Predict2.5) with a latent-action VAE, followed by target-robot post-training and a Self-Forcing distillation to autoregressive, real-time inference at 10.81 FPS. The approach yields strong out-of-distribution generalization, robust action controllability, and practical downstream benefits such as policy evaluation, live teleoperation, and model-based planning, supported by extensive benchmarks and human evaluations. This work demonstrates a scalable path toward general-purpose robot world models capable of transferring knowledge from humans to diverse robotic embodiments and tasks, while highlighting areas for broader action coverage and speed improvements.

Abstract

Being able to simulate the outcomes of actions in varied environments will revolutionize the development of generalist agents at scale. However, modeling these world dynamics, especially for dexterous robotics tasks, poses significant challenges due to limited data coverage and scarce action labels. As an endeavor towards this end, we introduce DreamDojo, a foundation world model that learns diverse interactions and dexterous controls from 44k hours of egocentric human videos. Our data mixture represents the largest video dataset to date for world model pretraining, spanning a wide range of daily scenarios with diverse objects and skills. To address the scarcity of action labels, we introduce continuous latent actions as unified proxy actions, enhancing interaction knowledge transfer from unlabeled videos. After post-training on small-scale target robot data, DreamDojo demonstrates a strong understanding of physics and precise action controllability. We also devise a distillation pipeline that accelerates DreamDojo to a real-time speed of 10.81 FPS and further improves context consistency. Our work enables several important applications based on generative world models, including live teleoperation, policy evaluation, and model-based planning. Systematic evaluation on multiple challenging out-of-distribution (OOD) benchmarks verifies the significance of our method for simulating open-world, contact-rich tasks, paving the way for general-purpose robot world models.
Paper Structure (29 sections, 8 equations, 14 figures, 7 tables)

This paper contains 29 sections, 8 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: DreamDojo overview.DreamDojo acquires comprehensive physical knowledge from large-scale human datasets by utilizing latent actions as unified labels. After post-training and distillation on the target robots, our model can predict the future world in real time with continuous action controls. DreamDojo can robustly generalize to various objects and environments, facilitating large-scale policy evaluation without real-world deployment. It also enables live teleoperation and online model-based planning.
  • Figure 2: Distribution analysis of DreamDojo-HV.(a) Distribution of the scenarios and random examples from the most frequent categories. (b) [Left]: Distribution of subtask numbers within each video. Most videos involve long-horizon tasks that require multiple interactions to accomplish. [Right]: Representative skills in DreamDojo-HV and their frequencies. Our dataset covers a wide range of interaction types beyond pick-and-place. (c) Visualization of skill verbs and object names based on their frequency of occurrence in language annotations.
  • Figure 3: Latent action model. [Left]: The information bottleneck design of our latent action model enforces action disentanglement, producing a continuous latent vector that represents actions between frames. [Right]: We retrieve and group the frame pairs from different datasets that share the most similar latent actions. The embodiments are performing the same actions despite the significant differences in context.
  • Figure 4: Benchmark visualization. We rigorously construct six evaluation benchmarks that reflect the diverse scenarios and actions present in human datasets, while being out-of-distribution for the robot training datasets.
  • Figure 5: Downstream applications. We show evidences that can be readily applied to benefit robot learning in policy evaluation without requiring real-world deployment, as well as for test-time model-based planning.
  • ...and 9 more figures