Table of Contents
Fetching ...

Learning Dexterous Grasping from Sparse Taxonomy Guidance

Juhan Park, Taerim Yoon, Seungmin Kim, Joonggil Kim, Wontae Ye, Jeongeun Park, Yoonbyung Chai, Geonwoo Cho, Geunwoo Cho, Dohyeong Kim, Kyungjae Lee, Yongjae Kim, Sungjoon Choi

Abstract

Dexterous manipulation requires planning a grasp configuration suited to the object and task, which is then executed through coordinated multi-finger control. However, specifying grasp plans with dense pose or contact targets for every object and task is impractical. Meanwhile, end-to-end reinforcement learning from task rewards alone lacks controllability, making it difficult for users to intervene when failures occur. To this end, we present GRIT, a two-stage framework that learns dexterous control from sparse taxonomy guidance. GRIT first predicts a taxonomy-based grasp specification from the scene and task context. Conditioned on this sparse command, a policy generates continuous finger motions that accomplish the task while preserving the intended grasp structure. Our result shows that certain grasp taxonomies are more effective for specific object geometries. By leveraging this relationship, GRIT improves generalization to novel objects over baselines and achieves an overall success rate of 87.9%. Moreover, real-world experiments demonstrate controllability, enabling grasp strategies to be adjusted through high-level taxonomy selection based on object geometry and task intent.

Learning Dexterous Grasping from Sparse Taxonomy Guidance

Abstract

Dexterous manipulation requires planning a grasp configuration suited to the object and task, which is then executed through coordinated multi-finger control. However, specifying grasp plans with dense pose or contact targets for every object and task is impractical. Meanwhile, end-to-end reinforcement learning from task rewards alone lacks controllability, making it difficult for users to intervene when failures occur. To this end, we present GRIT, a two-stage framework that learns dexterous control from sparse taxonomy guidance. GRIT first predicts a taxonomy-based grasp specification from the scene and task context. Conditioned on this sparse command, a policy generates continuous finger motions that accomplish the task while preserving the intended grasp structure. Our result shows that certain grasp taxonomies are more effective for specific object geometries. By leveraging this relationship, GRIT improves generalization to novel objects over baselines and achieves an overall success rate of 87.9%. Moreover, real-world experiments demonstrate controllability, enabling grasp strategies to be adjusted through high-level taxonomy selection based on object geometry and task intent.

Paper Structure

This paper contains 29 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure A1: Our framework selects appropriate grasp taxonomies based on object geometry and task context, and executes them through a taxonomy-conditioned control policy.
  • Figure C1: The proposed framework consists of three main components. 1) A taxonomy library providing canonical grasp templates defined by reference hand configurations and contact structures. 2) A grasp generation module that randomly samples a taxonomy during training, while at inference, a vision–language model selects one from the scene and task description. 3) A taxonomy-conditioned control policy learned via teacher–student distillation, where the teacher uses privileged information and the student relies on action–state history and partial visual observations.
  • Figure E1: The object pose is fixed while the wrist orientation is sampled from eight directions around the object. For each direction, 30 trials are performed with randomized initial wrist positions.
  • Figure E2: performance comparison according to the mimic reward weight $w_{\mathrm{mimic}}$. Ours maintains superior performance without hyperparameter tuning. $^*$ Methods retrained with the same taxonomy condition and rewards for a fair comparison.
  • Figure E3: Qualitative examples of generated grasps in simulation and real-world settings.
  • ...and 1 more figures