Table of Contents
Fetching ...

DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos

Jhen Hsieh, Kuan-Hsun Tu, Kuo-Han Hung, Tsung-Wei Ke

TL;DR

DexMan tackles the challenge of acquiring bimanual dexterous manipulation skills from human videos without calibration or motion capture. It presents a four-stage pipeline—3D object reconstruction, hand/object pose estimation, motion retargeting to a humanoid, and a residual RL policy guided by a contact-centric reward—to learn policies from noisy monocular videos. Key contributions include a depth-informed object pose pipeline, a stable object-pose sampling strategy, a learned finger IK, and a contact-prior attraction reward that robustly guides RL toward meaningful grasps. Empirically, DexMan achieves state-of-the-art performance on TACO pose estimation and OakInk-v2 RL benchmarks, and demonstrates end-to-end video-to-robot skill transfer from real and synthetic monocular videos, enabling scalable datasets for generalist dexterous manipulation while highlighting remaining sim-to-real gaps.

Abstract

We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.

DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos

TL;DR

DexMan tackles the challenge of acquiring bimanual dexterous manipulation skills from human videos without calibration or motion capture. It presents a four-stage pipeline—3D object reconstruction, hand/object pose estimation, motion retargeting to a humanoid, and a residual RL policy guided by a contact-centric reward—to learn policies from noisy monocular videos. Key contributions include a depth-informed object pose pipeline, a stable object-pose sampling strategy, a learned finger IK, and a contact-prior attraction reward that robustly guides RL toward meaningful grasps. Empirically, DexMan achieves state-of-the-art performance on TACO pose estimation and OakInk-v2 RL benchmarks, and demonstrates end-to-end video-to-robot skill transfer from real and synthetic monocular videos, enabling scalable datasets for generalist dexterous manipulation while highlighting remaining sim-to-real gaps.

Abstract

We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.

Paper Structure

This paper contains 36 sections, 14 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: DexMan is an automated framework that transfers human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Going beyond motion-capture data, DexMan can generate skills from either in-the-wild or synthetic videos, eliminating the need for manual data collection, and thereby enabling the curation of large-scale robotic datasets.
  • Figure 2: Overview of DexMan. DexMan is a framework for acquiring robot skills from human videos. Top: From monocular input, DexMan reconstructs object meshes, estimates depth, and recovers 3D hand–object motions, then retargets these to a full humanoid robot in simulation makoviychuk2021isaac rather than floating hands. Bottom: A residual RL policy refines retargeted motions to reproduce object trajectories, guided by human motion and contact priors. DexMan introduces a contact reward that encourages stable grasps for effective RL training, enabling the robot to complete demonstrated manipulation tasks.
  • Figure 3: Sampling stable object configuration. DexMan perturbs object poses with random axes and angles, simulates each configuration, and selects the stable one closest to the original pose for placement in simulation.
  • Figure 4: Contact reward. The attraction term pulls robot hand keypoints $\mathbf{p}_{j,t}^{\mathrm{rob}}$ toward human-contacted object vertices $\mathbf{v}_{j,t}^o$ and aligns the keypoint–vertex vector with the surface normal $\mathbf{n}_{j,t}^{\mathrm{rob}}$, ensuring within-grasp contacts.
  • Figure 5: Visual comparison of object pose estimation. We show the estimated object pose with an oriented 3D bounding box, along with three coordinate axes. Our method incorporates additional motion cues--3D point trajectories, producing more stable and accurate pose estimation than FoundationPose's outputs wen2024foundationpose.
  • ...and 17 more figures