Table of Contents
Fetching ...

DexterityGen: Foundation Controller for Unprecedented Dexterity

Zhao-Heng Yin, Changhao Wang, Luis Pineda, Francois Hogan, Krishna Bodduluri, Akash Sharma, Patrick Lancaster, Ishita Prasad, Mrinal Kalakrishnan, Jitendra Malik, Mike Lambeta, Tingfan Wu, Pieter Abbeel, Mustafa Mukadam

TL;DR

DexterityGen tackles the challenge of dexterous in-hand manipulation by training a large-scale, simulation-based dataset of low-level motion primitives via reinforcement learning and distilling them into a diffusion-based generator. This DexGen controller translates coarse, externally provided motion prompts into safe, fine-grained finger motions, with an inverse dynamics module converting those motions into executable actions; gradient guidance during diffusion preserves the input command while ensuring stability. In real-world tests, DexGen enables unprecedented dexterous behavior, including reorientation and tool use (e.g., pen, syringe, screwdriver), under teleoperation prompts and demonstrates robust shared autonomy with protective contact stabilization. The work suggests a practical, scalable path to a foundation controller for dexterous robotics, capable of coupling high-level semantic guidance with reliable low-level execution across diverse objects and tasks, while outlining important directions for future improvements such as touch sensing and vision integration.

Abstract

Teaching robots dexterous manipulation skills, such as tool use, presents a significant challenge. Current approaches can be broadly categorized into two strategies: human teleoperation (for imitation learning) and sim-to-real reinforcement learning. The first approach is difficult as it is hard for humans to produce safe and dexterous motions on a different embodiment without touch feedback. The second RL-based approach struggles with the domain gap and involves highly task-specific reward engineering on complex tasks. Our key insight is that RL is effective at learning low-level motion primitives, while humans excel at providing coarse motion commands for complex, long-horizon tasks. Therefore, the optimal solution might be a combination of both approaches. In this paper, we introduce DexterityGen (DexGen), which uses RL to pretrain large-scale dexterous motion primitives, such as in-hand rotation or translation. We then leverage this learned dataset to train a dexterous foundational controller. In the real world, we use human teleoperation as a prompt to the controller to produce highly dexterous behavior. We evaluate the effectiveness of DexGen in both simulation and real world, demonstrating that it is a general-purpose controller that can realize input dexterous manipulation commands and significantly improves stability by 10-100x measured as duration of holding objects across diverse tasks. Notably, with DexGen we demonstrate unprecedented dexterous skills including diverse object reorientation and dexterous tool use such as pen, syringe, and screwdriver for the first time.

DexterityGen: Foundation Controller for Unprecedented Dexterity

TL;DR

DexterityGen tackles the challenge of dexterous in-hand manipulation by training a large-scale, simulation-based dataset of low-level motion primitives via reinforcement learning and distilling them into a diffusion-based generator. This DexGen controller translates coarse, externally provided motion prompts into safe, fine-grained finger motions, with an inverse dynamics module converting those motions into executable actions; gradient guidance during diffusion preserves the input command while ensuring stability. In real-world tests, DexGen enables unprecedented dexterous behavior, including reorientation and tool use (e.g., pen, syringe, screwdriver), under teleoperation prompts and demonstrates robust shared autonomy with protective contact stabilization. The work suggests a practical, scalable path to a foundation controller for dexterous robotics, capable of coupling high-level semantic guidance with reliable low-level execution across diverse objects and tasks, while outlining important directions for future improvements such as touch sensing and vision integration.

Abstract

Teaching robots dexterous manipulation skills, such as tool use, presents a significant challenge. Current approaches can be broadly categorized into two strategies: human teleoperation (for imitation learning) and sim-to-real reinforcement learning. The first approach is difficult as it is hard for humans to produce safe and dexterous motions on a different embodiment without touch feedback. The second RL-based approach struggles with the domain gap and involves highly task-specific reward engineering on complex tasks. Our key insight is that RL is effective at learning low-level motion primitives, while humans excel at providing coarse motion commands for complex, long-horizon tasks. Therefore, the optimal solution might be a combination of both approaches. In this paper, we introduce DexterityGen (DexGen), which uses RL to pretrain large-scale dexterous motion primitives, such as in-hand rotation or translation. We then leverage this learned dataset to train a dexterous foundational controller. In the real world, we use human teleoperation as a prompt to the controller to produce highly dexterous behavior. We evaluate the effectiveness of DexGen in both simulation and real world, demonstrating that it is a general-purpose controller that can realize input dexterous manipulation commands and significantly improves stability by 10-100x measured as duration of holding objects across diverse tasks. Notably, with DexGen we demonstrate unprecedented dexterous skills including diverse object reorientation and dexterous tool use such as pen, syringe, and screwdriver for the first time.

Paper Structure

This paper contains 37 sections, 8 equations, 10 figures, 3 tables, 5 algorithms.

Figures (10)

  • Figure 1: We introduce DexterityGen (DexGen) as a foundation controller that achieves unprecedented dexterous manipulation behavior with teleoperation. DexGen is a generative model that can translate an unsafe, coarse motion command produced by external policy to safe and fine actions. With human teleoperation as a high-level policy, DexGen exhibits unprecedented dexterity from diverse object rotation and regrasping to using pen, syringe, and screwdriver.
  • Figure 2: Overview of proposed framework. Left (Training): We collect a large multi-task dexterous in-hand manipulation dataset in simulation to pretrain a generative model that can generate diverse actions conditioned on the current state. The pretrained generative model can produce useful actions including rotation, translation, and more intricated behaviors. Right (Inference): During inference, we can project dangerous motion produced by teleoperation or policy back to a high-likelihood action with guided sampling. This makes DexGen capable of assisting a coarse high-level policy to perform complex object manipulations.
  • Figure 3: Dataset: The Anygrasp-to-Anygrasp dataset generation pipeline is designed for the generative pretraining of DexGen. For a wide variety of objects, we extensively search for potential grasp configurations, using these as both the initial and goal states for RL policies. To ensure our diffusion model can manage diverse scenarios, we incorporate varied wrist poses, movements, and domain randomization during RL training and data collection.
  • Figure 4: Model: Architecture of the DexGen controller. The whole system takes robot state, external motion conditioning, and mode conditioning as input. A diffusion model first generates the motion as the intermediate action representation. The motion conditioning is not fed into the diffusion model directly but as the gradient guidance during the diffusion sampling. Then, another inverse dynamics model will translate the generated motion to executable robot action. We implement our diffusion model as a UNet in this paper. The inverse dynamics model is a residual multilayer perceptron.
  • Figure 5: Our large-scale, multi-task pretraining dataset covers diverse grasp to grasp transitions (arrows). DexGen controller learns the dataset action distribution (purple shaded area) at each state, and we can use sequential motion prompting (purple triangle) to perform a useful long-horizon skill, connecting two distance states.
  • ...and 5 more figures