Table of Contents
Fetching ...

Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction

Hongyi Chen, Tony Dong, Tiancheng Wu, Liquan Wang, Yash Jangir, Yaru Niu, Yufei Ye, Homanga Bharadhwaj, Zackory Erickson, Jeffrey Ichnowski

TL;DR

VideoManip addresses the challenge of learning dexterous manipulation from RGB human videos without robot demonstrations by reconstructing explicit $4$D hand–object trajectories and retargeting them to robot hands. It introduces two core components—differentiable hand–object contact optimization and DemoGen trajectory synthesis—to produce diverse, physically plausible demonstrations from a single video, enabling generalizable policies. Empirical results show a $70.25\%$ success rate across 20 objects in simulation with the Inspire Hand and a $62.86\%$ average success in seven real-world manipulation tasks with the LEAP Hand, outperforming retargeting-based baselines by about $15.87\%$. The work demonstrates a scalable, device-free approach to dexterous manipulation learning from ubiquitous RGB videos, with potential for broad applicability and data augmentation in robotics.

Abstract

Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 4D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.

Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction

TL;DR

VideoManip addresses the challenge of learning dexterous manipulation from RGB human videos without robot demonstrations by reconstructing explicit D hand–object trajectories and retargeting them to robot hands. It introduces two core components—differentiable hand–object contact optimization and DemoGen trajectory synthesis—to produce diverse, physically plausible demonstrations from a single video, enabling generalizable policies. Empirical results show a success rate across 20 objects in simulation with the Inspire Hand and a average success in seven real-world manipulation tasks with the LEAP Hand, outperforming retargeting-based baselines by about . The work demonstrates a scalable, device-free approach to dexterous manipulation learning from ubiquitous RGB videos, with potential for broad applicability and data augmentation in robotics.

Abstract

Multi-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 4D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at videomanip.github.io.
Paper Structure (11 sections, 2 equations, 5 figures, 2 tables)

This paper contains 11 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the VideoManip framework.We first reconstruct 4D robot–object interaction trajectories from RGB human videos via recent advances in 3D vision (Sec. \ref{['sec:method_recon']}). To utilize the reconstructed data for dexterous grasping and manipulation learning, we perform grasp contact optimization and interaction-centric grasp modeling, and synthesize trajectories for generalizable manipulation (Sec. \ref{['sec:method_train']}). Finally, we deploy the trained models for real-world dexterous grasping and manipulation.
  • Figure 2: Quantitative Results on Grasping and Manipulation.Grasping: (a) Success rates across object groups, with comparison between models trained with and without grasp optimization; (b) Ablation study on incorporating additional videos for previously failed objects. Manipulation: (c) Performance comparison between our method and baselines across seven manipulation tasks; (d) Ablation study on the number of DemoGen-synthesized trajectories.
  • Figure 3: Predicted grasps and success rates in IsaacGym. The DRO grasping model is trained on 20 object categories. Each object is evaluated over 100 trials and sorted by descending success rate; red dotted box denotes failed grasps.
  • Figure 4: (a) Visualization of the 20 objects used for video collection and grasp model training. (b) Grasp optimization results on Bottle, Spray Bottle, and Apple. Left: unoptimized reconstructed grasps; right: grasps optimized using ContactOpt grady2021contactopt.
  • Figure 5: Visualization of VideoManip Execution Using In-Scene (top) and In-the-Wild (bottom) Video Data Sources. For each task, given RGB human videos (row 1), we reconstruct 4D trajectories of the human hand and objects (row 2). Trained with these trajectories, we executed on a real-world LEAP Hand (row 3).