Table of Contents
Fetching ...

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou

TL;DR

H2R-Grounder tackles the challenge of learning robot manipulation from unpaired human videos by introducing H2Rep, a transferable abstraction that unifies human hand poses with robot gripper trajectories in a background scene. It trains a diffusion-based video generator via in-context learning using robot-only data after removing the robot and overlaying pose cues, then applies the learned model to human videos by generating corresponding robot sequences conditioned on H2Rep. The method achieves superior motion realism, background coherence, and physical plausibility compared to rendering-, animation-, and editing-based baselines, demonstrated on DexYCB, Droid, and in-the-wild videos. This paired-data-free paradigm enables scalable robot-learning from abundant human footage without requiring calibrated setups or frame-aligned human-robot pairs. The results suggest a practical route to broad, data-efficient robot manipulation capabilities in diverse environments, with limitations discussed for multi-hand and cross-embodiment generalization.

Abstract

Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

TL;DR

H2R-Grounder tackles the challenge of learning robot manipulation from unpaired human videos by introducing H2Rep, a transferable abstraction that unifies human hand poses with robot gripper trajectories in a background scene. It trains a diffusion-based video generator via in-context learning using robot-only data after removing the robot and overlaying pose cues, then applies the learned model to human videos by generating corresponding robot sequences conditioned on H2Rep. The method achieves superior motion realism, background coherence, and physical plausibility compared to rendering-, animation-, and editing-based baselines, demonstrated on DexYCB, Droid, and in-the-wild videos. This paired-data-free paradigm enables scalable robot-learning from abundant human footage without requiring calibrated setups or frame-aligned human-robot pairs. The results suggest a practical route to broad, data-efficient robot manipulation capabilities in diverse environments, with limitations discussed for multi-hand and cross-embodiment generalization.

Abstract

Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/

Paper Structure

This paper contains 27 sections, 11 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: H2R-Grounder converts human interaction videos into temporally aligned robotic manipulation videos, maintaining motion and background consistency and ensuring physically plausible robot arm structures and interactions. RoboMaster robomaster (animation-based) losees motion and background consistency. Kling kling and Runway Aleph aleph (editing-based) produce geometrically distorted robot arms.
  • Figure 2: Issues in prior rendering-based H2R methods. (a) shows the rendered robot arm from Phantomphantom, produced using their released code and provided calibrated camera parameters. Without accurate depth, the gripper appears to “float’’ above the book. (b) shows an overlaid robotic arm from H2Rh2r, collected from their public dataset, which suffers from severe floating artifacts and camera misalignment.
  • Figure 3: Paradigm of H2R-Grounder. The overall pipeline consists of three stages: (1) training data collection from robot video datasets, (2) in-context fine-tuning of the video generation model, and (3) transfer from in-the-wild human videos to robot manipulation videos.
  • Figure 4: Comparison of video inpainting methods on the robot arm removal task, evaluated on a sample from the Droid droid dataset.
  • Figure 5: OOD H2R transfer. Top row: results on internet videos. Bottom row: results on DexYCB dexycb videos.
  • ...and 1 more figures