Table of Contents
Fetching ...

DextrAH-RGB: Visuomotor Policies to Grasp Anything with Dexterous Hands

Ritvik Singh, Arthur Allshire, Ankur Handa, Nathan Ratliff, Karl Van Wyk

TL;DR

DextrAH-RGB advances end-to-end RGB-based visuomotor control for dexterous grasping by training a privileged state-based Fabric-Guided Policy (FGP) in simulation and distilling it into a stereo RGB-based policy via online DAgger and photorealistic rendering. The approach leverages geometric fabrics to enforce safe, reactive behavior and uses a cross-attention transformer to extract depth cues from stereo RGB inputs. Real-world experiments with a Kuka iiwa and Allegro Hand demonstrate competitive sim-to-real transfer across unseen objects and lighting conditions, though transfer variability and training complexity remain challenges. Overall, the work establishes a scalable path for RGB-driven dexterous manipulation trained in simulation with robust real-world performance, paving the way for multi-object and more dexterous capabilities.

Abstract

One of the most important, yet challenging, skills for a dexterous robot is grasping a diverse range of objects. Much of the prior work has been limited by speed, generality, or reliance on depth maps and object poses. In this paper, we introduce DextrAH-RGB, a system that can perform dexterous arm-hand grasping end-to-end from RGB image input. We train a privileged fabric-guided policy (FGP) in simulation through reinforcement learning that acts on a geometric fabric controller to dexterously grasp a wide variety of objects. We then distill this privileged FGP into a RGB-based FGP strictly in simulation using photorealistic tiled rendering. To our knowledge, this is the first work that is able to demonstrate robust sim2real transfer of an end2end RGB-based policy for complex, dynamic, contact-rich tasks such as dexterous grasping. DextrAH-RGB is competitive with depth-based dexterous grasping policies, and generalizes to novel objects with unseen geometry, texture, and lighting conditions in the real world. Videos of our system grasping a diverse range of unseen objects are available at \url{https://dextrah-rgb.github.io/}.

DextrAH-RGB: Visuomotor Policies to Grasp Anything with Dexterous Hands

TL;DR

DextrAH-RGB advances end-to-end RGB-based visuomotor control for dexterous grasping by training a privileged state-based Fabric-Guided Policy (FGP) in simulation and distilling it into a stereo RGB-based policy via online DAgger and photorealistic rendering. The approach leverages geometric fabrics to enforce safe, reactive behavior and uses a cross-attention transformer to extract depth cues from stereo RGB inputs. Real-world experiments with a Kuka iiwa and Allegro Hand demonstrate competitive sim-to-real transfer across unseen objects and lighting conditions, though transfer variability and training complexity remain challenges. Overall, the work establishes a scalable path for RGB-driven dexterous manipulation trained in simulation with robust real-world performance, paving the way for multi-object and more dexterous capabilities.

Abstract

One of the most important, yet challenging, skills for a dexterous robot is grasping a diverse range of objects. Much of the prior work has been limited by speed, generality, or reliance on depth maps and object poses. In this paper, we introduce DextrAH-RGB, a system that can perform dexterous arm-hand grasping end-to-end from RGB image input. We train a privileged fabric-guided policy (FGP) in simulation through reinforcement learning that acts on a geometric fabric controller to dexterously grasp a wide variety of objects. We then distill this privileged FGP into a RGB-based FGP strictly in simulation using photorealistic tiled rendering. To our knowledge, this is the first work that is able to demonstrate robust sim2real transfer of an end2end RGB-based policy for complex, dynamic, contact-rich tasks such as dexterous grasping. DextrAH-RGB is competitive with depth-based dexterous grasping policies, and generalizes to novel objects with unseen geometry, texture, and lighting conditions in the real world. Videos of our system grasping a diverse range of unseen objects are available at \url{https://dextrah-rgb.github.io/}.

Paper Structure

This paper contains 13 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: DextrAH-RGB (Dexterous Arm-Hand RGB) is an end-to-end RGB-based policy that can dexterously grasp a wide variety of objects.
  • Figure 2: The first stage of our pipeline involves training a state-based teacher policy in simulation using PPO. We adopt an asymmetric actor critic framework whereby the teacher policy receives noisy state observations whereas the critic receives privileged (and perfect) state observations. This is done to ensure that the policy is not overly reliant on behaviors that require accurate state estimates as this can make it harder to distill into a vision-based policy. The teacher policy uses an LSTM layer to enable reasoning over historical context, enabling adaptation to current dynamics. We add dense skip connections similar to lum2024dextrahgpixelstoactiondexterousarmhand to improve stability and performance of the policy.
  • Figure 3: The second stage of our pipeline involves distilling the previously trained state-based teacher policy into a vision-based student policy. We use an online implementation of DAgger ross2011reductionimitationlearningstructured where at each step, the observations for the teacher and student are queried and fed into the respective networks. The student is supervised to minimize the KL-divergence between its action distribution and that of the teacher. Furthermore, we add an auxiliary loss for predicting the 3D object position. The student architecture consists of a encoder which takes as input images from the left and right camera and outputs a stereo embedding which is then concatenated with the standard robot proprioceptive observations and passed into an LSTM followed by an MLP. Similar to the teacher, dense connections are used to improve the performance of the policy.
  • Figure 4: For all physics parameters $p^\texttt{i}$, the initial values for $p^\texttt{i\_lo}$ and $p^\texttt{i\_hi}$ are initialized to $p^\texttt{i}_{\texttt{init}}$. As the policy starts performing better, $p^\texttt{i\_lo}$ is decremented by $\Delta^n$ and $p^\texttt{i\_hi}$ is incremented by $\Delta^n$. The parameter value ranges are constantly increased until they reach the terminal values $p^\texttt{i\_lo}_{\texttt{terminal}}$ and $p^\texttt{i\_hi}_{\texttt{terminal}}$.
  • Figure 5: (a) shows an example subset of object meshes with no texture. (b) Shows those meshes with random textures binded to them.
  • ...and 3 more figures