Table of Contents
Fetching ...

TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dat Dao, Yongyuan Liang, Jia-Bin Huang, Furong Huang

TL;DR

TraceGen tackles the problem of learning new robot skills from few demonstrations across varied embodiments by moving learning from pixel and token spaces into a compact 3D trace-space of scene-level trajectories. It combines TraceForge, a data engine that converts heterogeneous videos into consistent 3D traces with camera-motion compensation and speed retargeting, with a flow-based decoder that predicts 3D traces conditioned on multimodal inputs. Pretraining on 1.8M observation–trace–language triplets (from TraceForge-123K data) enables rapid adaptation: five target-robot demonstrations yield about 80% success across four tasks, and five uncalibrated human demos achieve 67.5% real-robot success, while inference is tens to hundreds of times faster than pixel-based video models. This cross-embodiment, data-efficient approach reduces reliance on detectors and heavy pixel generation, enabling practical, real-time cross-domain manipulation with strong transfer and robustness across scenes and embodiments.

Abstract

Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.

TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

TL;DR

TraceGen tackles the problem of learning new robot skills from few demonstrations across varied embodiments by moving learning from pixel and token spaces into a compact 3D trace-space of scene-level trajectories. It combines TraceForge, a data engine that converts heterogeneous videos into consistent 3D traces with camera-motion compensation and speed retargeting, with a flow-based decoder that predicts 3D traces conditioned on multimodal inputs. Pretraining on 1.8M observation–trace–language triplets (from TraceForge-123K data) enables rapid adaptation: five target-robot demonstrations yield about 80% success across four tasks, and five uncalibrated human demos achieve 67.5% real-robot success, while inference is tens to hundreds of times faster than pixel-based video models. This cross-embodiment, data-efficient approach reduces reliance on detectors and heavy pixel generation, enabling practical, real-time cross-domain manipulation with strong transfer and robustness across scenes and embodiments.

Abstract

Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.

Paper Structure

This paper contains 58 sections, 7 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: TraceForge provides the structured training signal, and TraceGen consumes this signal to learn a world model in 3D trace space. Pretrained on 1.8M observation–trace–language triplets from the TraceForge-123K corpus---combining in-the-wild human videos and heterogeneous robot datasets---TraceGen acquires a strong 3D motion prior, enabling rapid adaptation to new skills and new environments. Bottom-left: Robot-domain warm-up. With only five target-robot demonstrations, TraceGen reaches 80% success across four tasks and is 50$\times$ faster than video-based world models (Veo 3.1 inference via API averages). Bottom-right: Human$\rightarrow$Robot transfer. With just five uncalibrated handheld human videos---featuring different backgrounds and object positions---TraceGen attains 67.5% real-robot success.
  • Figure 2: TraceForge-123K dataset distribution. Our corpus contains 1.8M observation–trace–language triplets, spanning tabletop, egocentric, and in-the-wild footage with moving cameras to support generalization across embodiments and scenes.
  • Figure 3: Failure cases of existing embodied world models. (a) Video-based models can hallucinate geometry or affordance. (b) VLM token outputs fail to capture fine motion. Bounding boxes miss the tool (c) or become overly broad (d).
  • Figure 4: Building the TraceForge dataset. From an input video $V_{\mathrm{in}}$: (i) chunk task-relevant spans for curation and generate task instructions (\ref{['subsec:chunking']}); (ii) estimate camera pose and depth, select a reference image and track 3D points to form a raw trace (\ref{['subsec:ptstracking']}); (iii) apply world–to-camera alignment (\ref{['subsec:w2c']}); (iv) speed retargeting to produce the final 3D trace (\ref{['subsec:retarg']}).
  • Figure 5: Overview of TraceGen. Given language, RGB, and depth inputs, text is encoded by a frozen T5 encoder, RGB images are processed by DINOv3 and SigLIP, and depth maps are passed through a SigLIP encoder with a learnable stem adapter. The resulting visual features (RGB + depth) are concatenated and linearly projected to form unified visual tokens. Together with text tokens, these serve as conditioning inputs to a CogVideoX-based flow model, which predicts a velocity field that transforms Gaussian noise into trace patches via ODE integration. $\mathbf{X}^1$ represents the velocity-like 3D keypoint increments across frames predicted by the flow decoder, where $0, \cdots \tau_i, \tau_{i+1}, \cdots, 1$ denote the continuous interpolation times from pure noise to the clean trace increments. The predicted patches are then unpatched into 3D keypoint trajectories, expressed in the camera coordinate frame. These trajectories can be executed using various low-level controllers; in our experiments, we apply inverse kinematics to map predicted 3D traces to robot joint commands.
  • ...and 10 more figures