DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos
Jhen Hsieh, Kuan-Hsun Tu, Kuo-Han Hung, Tsung-Wei Ke
TL;DR
DexMan tackles the challenge of acquiring bimanual dexterous manipulation skills from human videos without calibration or motion capture. It presents a four-stage pipeline—3D object reconstruction, hand/object pose estimation, motion retargeting to a humanoid, and a residual RL policy guided by a contact-centric reward—to learn policies from noisy monocular videos. Key contributions include a depth-informed object pose pipeline, a stable object-pose sampling strategy, a learned finger IK, and a contact-prior attraction reward that robustly guides RL toward meaningful grasps. Empirically, DexMan achieves state-of-the-art performance on TACO pose estimation and OakInk-v2 RL benchmarks, and demonstrates end-to-end video-to-robot skill transfer from real and synthetic monocular videos, enabling scalable datasets for generalist dexterous manipulation while highlighting remaining sim-to-real gaps.
Abstract
We present DexMan, an automated framework that converts human visual demonstrations into bimanual dexterous manipulation skills for humanoid robots in simulation. Operating directly on third-person videos of humans manipulating rigid objects, DexMan eliminates the need for camera calibration, depth sensors, scanned 3D object assets, or ground-truth hand and object motion annotations. Unlike prior approaches that consider only simplified floating hands, it directly controls a humanoid robot and leverages novel contact-based rewards to improve policy learning from noisy hand-object poses estimated from in-the-wild videos. DexMan achieves state-of-the-art performance in object pose estimation on the TACO benchmark, with absolute gains of 0.08 and 0.12 in ADD-S and VSD. Meanwhile, its reinforcement learning policy surpasses previous methods by 19% in success rate on OakInk-v2. Furthermore, DexMan can generate skills from both real and synthetic videos, without the need for manual data collection and costly motion capture, and enabling the creation of large-scale, diverse datasets for training generalist dexterous manipulation.
