Table of Contents
Fetching ...

HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models

Dehao Huang, Wenlong Dong, Chao Tang, Hong Zhang

TL;DR

HGDiffuser presents a diffusion-based framework for task-oriented grasp generation that directly produces 6-DoF grasps guided by human demonstrations. By training a Diffusion Transformer to model a task-agnostic prior $\rho(\mathbf{H} \mid \mathbf{X}_o)$ and applying guided diffusion with a differentiable task-specific loss $L(\mathbf{X}_h, \mathbf{H}, \mathbf{X}_o)$, the method achieves single-stage generation of compliant grasps with improved efficiency. The approach leverages a VN-PointNet object encoder, gripper-point grasp encoding, and DiT backbones, with DSM training and annealed Langevin MCMC for inference, plus explicit constraints on region and orientation derived from human grasps. Experimental results on OakInk show HGDiffuser outperforms two-stage baselines in both success rate and inference time (notably reducing latency by up to ~81%), and ablations confirm the importance of the DiT backbone. Real-world tests with a Franka arm validate practical applicability, while also highlighting remaining challenges in perception and pose estimation under partial observations.

Abstract

Task-oriented grasping (TOG) is essential for robots to perform manipulation tasks, requiring grasps that are both stable and compliant with task-specific constraints. Humans naturally grasp objects in a task-oriented manner to facilitate subsequent manipulation tasks. By leveraging human grasp demonstrations, current methods can generate high-quality robotic parallel-jaw task-oriented grasps for diverse objects and tasks. However, they still encounter challenges in maintaining grasp stability and sampling efficiency. These methods typically rely on a two-stage process: first performing exhaustive task-agnostic grasp sampling in the 6-DoF space, then applying demonstration-induced constraints (e.g., contact regions and wrist orientations) to filter candidates. This leads to inefficiency and potential failure due to the vast sampling space. To address this, we propose the Human-guided Grasp Diffuser (HGDiffuser), a diffusion-based framework that integrates these constraints into a guided sampling process. Through this approach, HGDiffuser directly generates 6-DoF task-oriented grasps in a single stage, eliminating exhaustive task-agnostic sampling. Furthermore, by incorporating Diffusion Transformer (DiT) blocks as the feature backbone, HGDiffuser improves grasp generation quality compared to MLP-based methods. Experimental results demonstrate that our approach significantly improves the efficiency of task-oriented grasp generation, enabling more effective transfer of human grasping strategies to robotic systems. To access the source code and supplementary videos, visit https://sites.google.com/view/hgdiffuser.

HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models

TL;DR

HGDiffuser presents a diffusion-based framework for task-oriented grasp generation that directly produces 6-DoF grasps guided by human demonstrations. By training a Diffusion Transformer to model a task-agnostic prior and applying guided diffusion with a differentiable task-specific loss , the method achieves single-stage generation of compliant grasps with improved efficiency. The approach leverages a VN-PointNet object encoder, gripper-point grasp encoding, and DiT backbones, with DSM training and annealed Langevin MCMC for inference, plus explicit constraints on region and orientation derived from human grasps. Experimental results on OakInk show HGDiffuser outperforms two-stage baselines in both success rate and inference time (notably reducing latency by up to ~81%), and ablations confirm the importance of the DiT backbone. Real-world tests with a Franka arm validate practical applicability, while also highlighting remaining challenges in perception and pose estimation under partial observations.

Abstract

Task-oriented grasping (TOG) is essential for robots to perform manipulation tasks, requiring grasps that are both stable and compliant with task-specific constraints. Humans naturally grasp objects in a task-oriented manner to facilitate subsequent manipulation tasks. By leveraging human grasp demonstrations, current methods can generate high-quality robotic parallel-jaw task-oriented grasps for diverse objects and tasks. However, they still encounter challenges in maintaining grasp stability and sampling efficiency. These methods typically rely on a two-stage process: first performing exhaustive task-agnostic grasp sampling in the 6-DoF space, then applying demonstration-induced constraints (e.g., contact regions and wrist orientations) to filter candidates. This leads to inefficiency and potential failure due to the vast sampling space. To address this, we propose the Human-guided Grasp Diffuser (HGDiffuser), a diffusion-based framework that integrates these constraints into a guided sampling process. Through this approach, HGDiffuser directly generates 6-DoF task-oriented grasps in a single stage, eliminating exhaustive task-agnostic sampling. Furthermore, by incorporating Diffusion Transformer (DiT) blocks as the feature backbone, HGDiffuser improves grasp generation quality compared to MLP-based methods. Experimental results demonstrate that our approach significantly improves the efficiency of task-oriented grasp generation, enabling more effective transfer of human grasping strategies to robotic systems. To access the source code and supplementary videos, visit https://sites.google.com/view/hgdiffuser.

Paper Structure

This paper contains 11 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) Demonstration-based methods, which generate robotic 6-DoF parallel-jaw task-oriented grasps by leveraging human demonstrations. (b) Comparison of existing two-stage methods and our single-stage method. Unlike two-stage methods, which require extensive sampling followed by filtering to generate grasps, our method directly generates grasps with minimal sampling, making it more efficient.
  • Figure 2: Overview of our task-oriented grasping system. The task demonstrated is to handover a cup.
  • Figure 3: An overview of HGDiffuser. The grasp generation employs annealed Langevin MCMC sampling with $T$ steps. The input object point cloud $\mathbf{X}_{o}$ is encoded into feature $\mathbf{f}^{o}$ via vision encoder, while current grasp $\mathbf{H}_{t}$ is processed into $\mathbf{f}^{g}$ via geometry encoder. These features, along with step feature $\mathbf{f}^{t}$ from sinusoidal encoding, serve as inputs to the DiT-based backbone. The fused features are decoded to produce a noise conditional score. For the input human grasp $\mathbf{X}_{h}$, explicit task-oriented constraints are extracted to construct a loss function guiding the sampling process. The noise conditional score, combined with the loss function, updates grasp $\mathbf{H}_{k}$ to $\mathbf{H}_{k-1}$, iterating $L$ times to output final grasp $\mathbf{H}_{0}$.
  • Figure 4: Qualitative results of our method and Ours-TS method. The object categories and tasks are as follows: (a) toothbrush and brushing, (b) wine glass and pouring, (c) eyeglasses and handing over, (d) scissors and using. More results are provided in the supplementary material.
  • Figure 5: Quantitative results. In the bottom-right section, we compare our method with baseline approaches in terms of average task-oriented grasping success rate and average inference time. The remaining sections present the average success rates across 24 object categories (out of 236 total object instances) from the dataset.
  • ...and 1 more figures