UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping
Wenbo Wang, Fangyun Wei, Lei Zhou, Xi Chen, Lin Luo, Xiaohan Yi, Yizhong Zhang, Yaobo Liang, Chang Xu, Yan Lu, Jiaolong Yang, Baining Guo
TL;DR
UniGraspTransformer introduces a scalable, offline-policy-distillation framework that collapses thousands of object-specific RL policies into a single universal Transformer for dexterous grasping. The method trains dedicated policies for $3{,}200$ objects with PPO, generates $M=1000$ grasp trajectories per object, and distills them into a Transformer with $K=12$ self-attention layers and a $24$-D action head trained by an $\,\mathcal{L}_2$ loss. It supports both state-based and vision-based inputs via S-Encoder and V-Encoder, with vision-based adaptation using distillation to align latent representations, achieving state-based and vision-based gains of up to $4.9$/$7.7$ and $5.2$/$10.1$ percentage points on unseen objects/categories, respectively. The results show improved grasp diversity and generalization over the prior state-of-the-art UniDexGrasp++, validating a simple yet effective route to scalable, robust dexterous manipulation in real-world settings.
Abstract
We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting. Project page: https://dexhand.github.io/UniGraspTransformer.
