Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction
Hongyi Chen, Tony Dong, Tiancheng Wu, Liquan Wang, Yash Jangir, Yaru Niu, Yufei Ye, Homanga Bharadhwaj, Zackory Erickson, Jeffrey Ichnowski
TL;DR
VideoManip addresses the challenge of learning dexterous manipulation from RGB human videos without robot demonstrations by reconstructing explicit $4$D hand–object trajectories and retargeting them to robot hands. It introduces two core components—differentiable hand–object contact optimization and DemoGen trajectory synthesis—to produce diverse, physically plausible demonstrations from a single video, enabling generalizable policies. Empirical results show a $70.25\%$ success rate across 20 objects in simulation with the Inspire Hand and a $62.86\%$ average success in seven real-world manipulation tasks with the LEAP Hand, outperforming retargeting-based baselines by about $15.87\%$. The work demonstrates a scalable, device-free approach to dexterous manipulation learning from ubiquitous RGB videos, with potential for broad applicability and data augmentation in robotics.
Abstract
Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 4D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.
