Table of Contents
Fetching ...

UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

Wenbo Wang, Fangyun Wei, Lei Zhou, Xi Chen, Lin Luo, Xiaohan Yi, Yizhong Zhang, Yaobo Liang, Chang Xu, Yan Lu, Jiaolong Yang, Baining Guo

TL;DR

UniGraspTransformer introduces a scalable, offline-policy-distillation framework that collapses thousands of object-specific RL policies into a single universal Transformer for dexterous grasping. The method trains dedicated policies for $3{,}200$ objects with PPO, generates $M=1000$ grasp trajectories per object, and distills them into a Transformer with $K=12$ self-attention layers and a $24$-D action head trained by an $\,\mathcal{L}_2$ loss. It supports both state-based and vision-based inputs via S-Encoder and V-Encoder, with vision-based adaptation using distillation to align latent representations, achieving state-based and vision-based gains of up to $4.9$/$7.7$ and $5.2$/$10.1$ percentage points on unseen objects/categories, respectively. The results show improved grasp diversity and generalization over the prior state-of-the-art UniDexGrasp++, validating a simple yet effective route to scalable, robust dexterous manipulation in real-world settings.

Abstract

We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting. Project page: https://dexhand.github.io/UniGraspTransformer.

UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

TL;DR

UniGraspTransformer introduces a scalable, offline-policy-distillation framework that collapses thousands of object-specific RL policies into a single universal Transformer for dexterous grasping. The method trains dedicated policies for objects with PPO, generates grasp trajectories per object, and distills them into a Transformer with self-attention layers and a -D action head trained by an loss. It supports both state-based and vision-based inputs via S-Encoder and V-Encoder, with vision-based adaptation using distillation to align latent representations, achieving state-based and vision-based gains of up to / and / percentage points on unseen objects/categories, respectively. The results show improved grasp diversity and generalization over the prior state-of-the-art UniDexGrasp++, validating a simple yet effective route to scalable, robust dexterous manipulation in real-world settings.

Abstract

We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting. Project page: https://dexhand.github.io/UniGraspTransformer.

Paper Structure

This paper contains 20 sections, 6 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Performance comparison among UniDexGraspUniDexGrasp, UniDexGrasp++UniDexGrasp++ and our UniGraspTransformer, across state-based and vision-based settings. For each setting, success rates are evaluated on seen objects, unseen objects within seen categories, and entirely unseen objects from unseen categories.
  • Figure 2: Overview of UniGraspTransformer. (a) Dedicated policy network training: each individual RL policy network is trained to grasp a specific object with various initial poses. (b) Grasp trajectory generation: each policy network generates $M$ successful grasp trajectories, forming a trajectory set $\mathcal{D}$. (c) UniGraspTransformer training: trajectories from $\mathcal{D}$ are used to train UniGraspTransformer, a universal grasp network, in a supervised manner. We investigate two settings—state-based and vision-based—with the primary difference being in the input representation of object state and hand-object distance, as indicated by "*" in the figure. The architecture of S-Encoder and V-Encoder can be found in Figure \ref{['fig:autoencoder']}.
  • Figure 3: Illustration of the network architecture of the object point cloud encoder, S-Encoder, in the state-based setting. The process begins with sampling 1,024 points from the object point cloud, producing an input with a dimension of $1024 \times 3$. This input is passed through the encoder, producing a 128-dimensional object feature, which the decoder then uses to reconstruct the 1,024 sampled points, with the Chamfer Distance serving as the loss function. During inference, only the encoder is used to convert an object point cloud into a 128-dimensional object feature.
  • Figure 4: Quantitative analysis of grasp pose diversity.
  • Figure 5: Comparison of grasp poses generated by the state-based universal model from the UniDexGrasp++ UniDexGrasp++ (top row) and our UniGraspTransformer (bottom row). Each column displays two distinct grasp poses for the same object with the same initial pose.
  • ...and 6 more figures