Table of Contents
Fetching ...

DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation

Mengda Xu, Han Zhang, Yifan Hou, Zhenjia Xu, Linxi Fan, Manuela Veloso, Shuran Song

TL;DR

DexUMI tackles the embodiment gap in transferring dexterous manipulation skills from humans to diverse robot hands by coupling a wearable hand exoskeleton with a software pipeline that replaces human hands in demonstrations with robot hands. The hardware adaptation aligns human fingertip motion to the target robot hand while preserving wearability and capturing accurate joint angles and tactile data; the software adaptation renders robot-hand visuals to produce training data with consistent observations. Real-world experiments on Inspire and XHand across four tasks show 86% average success and a 3.2× increase in data-collection efficiency over teleoperation, with analyses highlighting the benefits of relative finger actions and tactile input under different conditions. Overall, DexUMI presents a scalable approach for efficient, cross-hardware dexterous policy learning using human-in-the-loop demonstrations aligned with robot capabilities.

Abstract

We present DexUMI - a data collection and policy learning framework that uses the human hand as the natural interface to transfer dexterous manipulation skills to various robot hands. DexUMI includes hardware and software adaptations to minimize the embodiment gap between the human hand and various robot hands. The hardware adaptation bridges the kinematics gap using a wearable hand exoskeleton. It allows direct haptic feedback in manipulation data collection and adapts human motion to feasible robot hand motion. The software adaptation bridges the visual gap by replacing the human hand in video data with high-fidelity robot hand inpainting. We demonstrate DexUMI's capabilities through comprehensive real-world experiments on two different dexterous robot hand hardware platforms, achieving an average task success rate of 86%.

DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation

TL;DR

DexUMI tackles the embodiment gap in transferring dexterous manipulation skills from humans to diverse robot hands by coupling a wearable hand exoskeleton with a software pipeline that replaces human hands in demonstrations with robot hands. The hardware adaptation aligns human fingertip motion to the target robot hand while preserving wearability and capturing accurate joint angles and tactile data; the software adaptation renders robot-hand visuals to produce training data with consistent observations. Real-world experiments on Inspire and XHand across four tasks show 86% average success and a 3.2× increase in data-collection efficiency over teleoperation, with analyses highlighting the benefits of relative finger actions and tactile input under different conditions. Overall, DexUMI presents a scalable approach for efficient, cross-hardware dexterous policy learning using human-in-the-loop demonstrations aligned with robot capabilities.

Abstract

We present DexUMI - a data collection and policy learning framework that uses the human hand as the natural interface to transfer dexterous manipulation skills to various robot hands. DexUMI includes hardware and software adaptations to minimize the embodiment gap between the human hand and various robot hands. The hardware adaptation bridges the kinematics gap using a wearable hand exoskeleton. It allows direct haptic feedback in manipulation data collection and adapts human motion to feasible robot hand motion. The software adaptation bridges the visual gap by replacing the human hand in video data with high-fidelity robot hand inpainting. We demonstrate DexUMI's capabilities through comprehensive real-world experiments on two different dexterous robot hand hardware platforms, achieving an average task success rate of 86%.

Paper Structure

This paper contains 27 sections, 4 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: DexUMI transfer dexterous human manipulation skills to various robot hand by using wearable exoskeletons and a data processing framework. We demonstrate DexUMI's capability and effectiveness on both underactuated (e.g., Inspire) and fully-actuated (e.g., XHand) robot hand for a wide variety of manipulation tasks. Please see project website https://dex-umi.github.io/ for details.
  • Figure 2: Exoskeleton Design. The optimized exoskeleton design shares the same joint-to-fingertip position mapping as the target robot hand while maintaining the wearability. The exoskeletons utilizes the encoder to precisely capture the joint action and 150° DFoV camera to record the information-rich visual observation. An iPhone is rigidly mounted to track the wrist pose through the ARKit.
  • Figure 3: Mechanism Optimization. To avoid thumb collision between human hand and exoskeleton, the hardware optimization step allows us to move the exoskeleton thumb backward while still preserving the original fingertip and joint mapping in SE(3) space.
  • Figure 4: Bridging the Visual Gap. To convert the visual observation into policy training data, we first segment the exoskeleton using SAM2 (b) and inpaint the missing background (c). The corresponding joint action (a) is replayed on the dexterous hand to obtain the robot hand image (d). SAM2 is applied to obtain the robot mask (e). The intersection (f) of the exoskeleton mask (b) and robot mask (e) identifies the visible part of the hand during interaction. Finally, we replace pixels in the inpainted background (c) with the visible robot hand (g).
  • Figure 5: Policy Rollout: We evaluate DexUMI's capabilities across challenging real-world tasks. The Cube task tests basic picking precision. The Egg Carton task evaluates multi-finger coordination. The Tea Picking task assesses performance on contact-rich manipulation requiring millimeter-level fine-grained fingertip actions. Finally, the Kitchen task tests capabilities on long-horizon high-precision actions to manipulate a knob, move a pan using both the side of thumb and index finger (beyond just fingertips), and utilize tactile sensing for visually challenging salt picking tasks.
  • ...and 7 more figures