DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

Juncheng Mu; Sizhe Yang; Yiming Bao; Hojin Bae; Tianming Wei; Linning Xu; Boyi Li; Huazhe Xu; Jiangmiao Pang

DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

Juncheng Mu, Sizhe Yang, Yiming Bao, Hojin Bae, Tianming Wei, Linning Xu, Boyi Li, Huazhe Xu, Jiangmiao Pang

TL;DR

DexImit proposes a scalable, depth-free, monocular-video-based pipeline to generate high-quality, physically plausible bimanual dexterous manipulation data from human demonstrations. By reconstructing 4D hand-object interactions, decomposing tasks with an action-centric scheduler, and generating robot trajectories using force-closure grasps and motion planning, it enables zero-shot policy transfer after comprehensive data augmentation. The approach demonstrates strong data usability, quality, and transfer capability across tool use, long-horizon tasks, and fine-grained manipulation, with successful real-world deployment on dual-arm robots. This framework addresses longstanding data scarcity in dexterous manipulation and offers a practical path to scalable, real-world robot learning from abundant human videos.

Abstract

Data scarcity fundamentally limits the generalization of bimanual dexterous manipulation, as real-world data collection for dexterous hands is expensive and labor-intensive. Human manipulation videos, as a direct carrier of manipulation knowledge, offer significant potential for scaling up robot learning. However, the substantial embodiment gap between human hands and robotic dexterous hands makes direct pretraining from human videos extremely challenging. To bridge this gap and unleash the potential of large-scale human manipulation video data, we propose DexImit, an automated framework that converts monocular human manipulation videos into physically plausible robot data, without any additional information. DexImit employs a four-stage generation pipeline: (1) reconstructing hand-object interactions from arbitrary viewpoints with near-metric scale; (2) performing subtask decomposition and bimanual scheduling; (3) synthesizing robot trajectories consistent with the demonstrated interactions; (4) comprehensive data augmentation for zero-shot real-world deployment. Building on these designs, DexImit can generate large-scale robot data based on human videos, either from the Internet or video generation models. DexImit is capable of handling diverse manipulation tasks, including tool use (e.g., cutting an apple), long-horizon tasks (e.g., making a beverage), and fine-grained manipulations (e.g., stacking cups).

DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

TL;DR

Abstract

Paper Structure (29 sections, 19 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 19 equations, 11 figures, 3 tables, 1 algorithm.

Introduction
Related Works
Learning from Videos
Monocular Reconstruction
Robot Data Generation from Reconstructed Reference
Method
Reconstruction of 4D Hand-Object Interactions
Video Process
Segmentation
Objects and Hands Reconstruction
6D Pose Estimation
World Coordinate Transformation
Subtask Decomposition and Task Scheduling
Source Data Generation
Grasp Synthesis
...and 14 more sections

Figures (11)

Figure 1: We introduce DexImit, a framework for learning dexterous manipulation directly from videos. DexImit leverages generated or in-the-wild videos to synthesize physically plausible demonstrations, including challenging tool-using, long-horizon, and fine-grained tasks. The gallery highlights the breadth of manipulation tasks generated by DexImit.
Figure 2: We adopt a four-stage paradigm: Reconstruction-Scheduling-Action-Augmentation. (1) Reconstruct 4D hand-object interactions and transform them to a unified world frame. (2) Decompose the manipulation process into subtasks and schedule bimanual actions for long-horizon tasks using an Action-Centric Scheduling Algorithm. (3) Generate robot trajectories via grasp synthesis and motion planning. (4) Augment the resulting source data comprehensively to enable robust policy learning.
Figure 3: Usability evaluation of generated dexterous manipulation data. The analysis considers two orthogonal factors: input data quality and target task difficulty. We report data usability rates for two representative manipulation tasks at each difficulty level, with usability visualized on a gray-to-green color scale.
Figure 4: DexImit can generate physically plausible data for long-horizon and fine-grained real-world tasks.
Figure 5: Real world experiment setup.
...and 6 more figures

DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

TL;DR

Abstract

DexImit: Learning Bimanual Dexterous Manipulation from Monocular Human Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (11)