H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou
TL;DR
H2R-Grounder tackles the challenge of learning robot manipulation from unpaired human videos by introducing H2Rep, a transferable abstraction that unifies human hand poses with robot gripper trajectories in a background scene. It trains a diffusion-based video generator via in-context learning using robot-only data after removing the robot and overlaying pose cues, then applies the learned model to human videos by generating corresponding robot sequences conditioned on H2Rep. The method achieves superior motion realism, background coherence, and physical plausibility compared to rendering-, animation-, and editing-based baselines, demonstrated on DexYCB, Droid, and in-the-wild videos. This paired-data-free paradigm enables scalable robot-learning from abundant human footage without requiring calibrated setups or frame-aligned human-robot pairs. The results suggest a practical route to broad, data-efficient robot manipulation capabilities in diverse environments, with limitations discussed for multi-hand and cross-embodiment generalization.
Abstract
Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/
