Table of Contents
Fetching ...

DexTOG: Learning Task-Oriented Dexterous Grasp with Language

Jieyi Zhang, Wenqiang Xu, Zhenjun Yu, Pengfei Xie, Tutian Tang, Cewu Lu

TL;DR

The paper introduces DexTOG, a language-guided diffusion framework for task-oriented grasping with dexterous hands, addressing the challenges of multi-modal optimal grasps and high DoF configuration spaces. DiffuTOG generates task-aware grasp poses conditioned on 3D object observations, hand geometry, and natural language task descriptions, with a test-time collision refinement. A data engine DexTOG produces the DexTOG-80K dataset by bootstrapping DiffuTOG with heuristic filtering and reinforcement learning validation across five articulated tasks on 80 objects, enabling both TOG and task-agnostic evaluation. Experimental results in simulation show improvements over baselines in both task-agnostic and task-oriented settings, with ablations highlighting the contributions of hand geometry, collision handling, and RL-driven verification. The work contributes a scalable data generation pipeline, a diffusion-based grasp model, and a comprehensive dataset to advance dexterous TOG research and manipulation benchmarks.

Abstract

This study introduces a novel language-guided diffusion-based learning framework, DexTOG, aimed at advancing the field of task-oriented grasping (TOG) with dexterous hands. Unlike existing methods that mainly focus on 2-finger grippers, this research addresses the complexities of dexterous manipulation, where the system must identify non-unique optimal grasp poses under specific task constraints, cater to multiple valid grasps, and search in a high degree-of-freedom configuration space in grasp planning. The proposed DexTOG includes a diffusion-based grasp pose generation model, DexDiffu, and a data engine to support the DexDiffu. By leveraging DexTOG, we also proposed a new dataset, DexTOG-80K, which was developed using a shadow robot hand to perform various tasks on 80 objects from 5 categories, showcasing the dexterity and multi-tasking capabilities of the robotic hand. This research not only presents a significant leap in dexterous TOG but also provides a comprehensive dataset and simulation validation, setting a new benchmark in robotic manipulation research.

DexTOG: Learning Task-Oriented Dexterous Grasp with Language

TL;DR

The paper introduces DexTOG, a language-guided diffusion framework for task-oriented grasping with dexterous hands, addressing the challenges of multi-modal optimal grasps and high DoF configuration spaces. DiffuTOG generates task-aware grasp poses conditioned on 3D object observations, hand geometry, and natural language task descriptions, with a test-time collision refinement. A data engine DexTOG produces the DexTOG-80K dataset by bootstrapping DiffuTOG with heuristic filtering and reinforcement learning validation across five articulated tasks on 80 objects, enabling both TOG and task-agnostic evaluation. Experimental results in simulation show improvements over baselines in both task-agnostic and task-oriented settings, with ablations highlighting the contributions of hand geometry, collision handling, and RL-driven verification. The work contributes a scalable data generation pipeline, a diffusion-based grasp model, and a comprehensive dataset to advance dexterous TOG research and manipulation benchmarks.

Abstract

This study introduces a novel language-guided diffusion-based learning framework, DexTOG, aimed at advancing the field of task-oriented grasping (TOG) with dexterous hands. Unlike existing methods that mainly focus on 2-finger grippers, this research addresses the complexities of dexterous manipulation, where the system must identify non-unique optimal grasp poses under specific task constraints, cater to multiple valid grasps, and search in a high degree-of-freedom configuration space in grasp planning. The proposed DexTOG includes a diffusion-based grasp pose generation model, DexDiffu, and a data engine to support the DexDiffu. By leveraging DexTOG, we also proposed a new dataset, DexTOG-80K, which was developed using a shadow robot hand to perform various tasks on 80 objects from 5 categories, showcasing the dexterity and multi-tasking capabilities of the robotic hand. This research not only presents a significant leap in dexterous TOG but also provides a comprehensive dataset and simulation validation, setting a new benchmark in robotic manipulation research.

Paper Structure

This paper contains 35 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Task-oriented grasp. The task-agnostic grasp only ensures the grasp is stable, while the task-oriented grasp needs to contact the affordance part for the downstream tasks.
  • Figure 2: Pipeline. Our method contains two stages: grasp generation and grasp execution. In the generation stage, DiffuTOG generates grasp proposals, and then a test-time optimizer is used to refine the proposals. In the execution stage, we use the refined grasp pose as the initial pose and train the state-based RL to complete the task. The execution stage here is only for verification purposes.
  • Figure 3: Samples in DexTOG-80K. The object and the corresponding task-oriented grasp.
  • Figure 4: Data generation process. The generic poses, which are the task-agnostic grasp poses, are first filtered by some rules. Then, the rule-filtered grasps are sent to DiffuTOG to amplify the grasp quantity. The RL policy finally verifies the amplified grasp poses.
  • Figure 5: Qualitative Results of the generated task-oriented grasp on both seen and unseen objects.