Table of Contents
Fetching ...

Learning Dexterous In-Hand Manipulation with Multifingered Hands via Visuomotor Diffusion

Piotr Koczy, Michael C. Welle, Danica Kragic

TL;DR

The paper tackles dexterous in-hand manipulation with multifingered hands by extending visuomotor diffusion policies to autonomous, one-hand unscrewing tasks. It introduces an AR-based teleoperation pipeline to collect high-quality demonstrations and a demonstration-filtering method using HDBSCAN and GLOSH to improve data reliability. Through comprehensive ablations, it shows that wrist-camera observations combined with joint positions and effort provide the strongest policy performance, achieving an 85% real-world success rate on unscrewing a bottle lid. The work demonstrates the feasibility of deploying visuomotor diffusion policies on mobile platforms and underscores the value of targeted demonstration filtering for robust dexterous control.

Abstract

We present a framework for learning dexterous in-hand manipulation with multifingered hands using visuomotor diffusion policies. Our system enables complex in-hand manipulation tasks, such as unscrewing a bottle lid with one hand, by leveraging a fast and responsive teleoperation setup for the four-fingered Allegro Hand. We collect high-quality expert demonstrations using an augmented reality (AR) interface that tracks hand movements and applies inverse kinematics and motion retargeting for precise control. The AR headset provides real-time visualization, while gesture controls streamline teleoperation. To enhance policy learning, we introduce a novel demonstration outlier removal approach based on HDBSCAN clustering and the Global-Local Outlier Score from Hierarchies (GLOSH) algorithm, effectively filtering out low-quality demonstrations that could degrade performance. We evaluate our approach extensively in real-world settings and provide all experimental videos on the project website: https://dex-manip.github.io/

Learning Dexterous In-Hand Manipulation with Multifingered Hands via Visuomotor Diffusion

TL;DR

The paper tackles dexterous in-hand manipulation with multifingered hands by extending visuomotor diffusion policies to autonomous, one-hand unscrewing tasks. It introduces an AR-based teleoperation pipeline to collect high-quality demonstrations and a demonstration-filtering method using HDBSCAN and GLOSH to improve data reliability. Through comprehensive ablations, it shows that wrist-camera observations combined with joint positions and effort provide the strongest policy performance, achieving an 85% real-world success rate on unscrewing a bottle lid. The work demonstrates the feasibility of deploying visuomotor diffusion policies on mobile platforms and underscores the value of targeted demonstration filtering for robust dexterous control.

Abstract

We present a framework for learning dexterous in-hand manipulation with multifingered hands using visuomotor diffusion policies. Our system enables complex in-hand manipulation tasks, such as unscrewing a bottle lid with one hand, by leveraging a fast and responsive teleoperation setup for the four-fingered Allegro Hand. We collect high-quality expert demonstrations using an augmented reality (AR) interface that tracks hand movements and applies inverse kinematics and motion retargeting for precise control. The AR headset provides real-time visualization, while gesture controls streamline teleoperation. To enhance policy learning, we introduce a novel demonstration outlier removal approach based on HDBSCAN clustering and the Global-Local Outlier Score from Hierarchies (GLOSH) algorithm, effectively filtering out low-quality demonstrations that could degrade performance. We evaluate our approach extensively in real-world settings and provide all experimental videos on the project website: https://dex-manip.github.io/

Paper Structure

This paper contains 12 sections, 2 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Our Allegro AR teleoperation system on the top shows the operator wearing the AR headset and seeing both the hand tracking and the Allegro hand in view, enabling intuitive and responsive operation. On the bottom, we see the trained visuomotor diffusion policy autonomously unscrewing the bottle.
  • Figure 2: Overview of our system: For teleoperation (blue boxes), we obtain the operator's hand position via the Meta Quest 3 hand tracking and send the vertex positions via a Unity-ROS TCP connection. A hand retargeting node then performs inverse kinematics and motion retargeting to obtain the relative target joint positions of the Allegro Hand $\Delta q$. We save the joint position $q$, the joint effort $\tau$, and the top and wrist camera images $I_t, I_w$. During autonomous operation, the trained visuomotor diffusion policy takes the Allegro Hand’s current joint position $q$, effort $\tau$, and camera images $I_t, I_w$ as input and outputs the next joint position change $\Delta q$ to execute the manipulation task.
  • Figure 3: Retargeting steps: (a) Initial alignment of human hand vertices (green spheres) to the Allegro Hand. (b) Scaling of finger joint lengths. (c) Final IK targets (red spheres) with additional adjustments to enhance control.
  • Figure 4: Left: Examples of randomized placement prompts to ensure positional diversity. Right: Histogram of demonstration durations.
  • Figure 5: Distribution of outlier scores, with vertical lines marking the $90$th, $70$th, and $50$th percentiles.
  • ...and 2 more figures