TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dat Dao, Yongyuan Liang, Jia-Bin Huang, Furong Huang
TL;DR
TraceGen tackles the problem of learning new robot skills from few demonstrations across varied embodiments by moving learning from pixel and token spaces into a compact 3D trace-space of scene-level trajectories. It combines TraceForge, a data engine that converts heterogeneous videos into consistent 3D traces with camera-motion compensation and speed retargeting, with a flow-based decoder that predicts 3D traces conditioned on multimodal inputs. Pretraining on 1.8M observation–trace–language triplets (from TraceForge-123K data) enables rapid adaptation: five target-robot demonstrations yield about 80% success across four tasks, and five uncalibrated human demos achieve 67.5% real-robot success, while inference is tens to hundreds of times faster than pixel-based video models. This cross-embodiment, data-efficient approach reduces reliance on detectors and heavy pixel generation, enabling practical, real-time cross-domain manipulation with strong transfer and robustness across scenes and embodiments.
Abstract
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
