HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models

Dehao Huang; Wenlong Dong; Chao Tang; Hong Zhang

HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models

Dehao Huang, Wenlong Dong, Chao Tang, Hong Zhang

TL;DR

HGDiffuser presents a diffusion-based framework for task-oriented grasp generation that directly produces 6-DoF grasps guided by human demonstrations. By training a Diffusion Transformer to model a task-agnostic prior $\rho(\mathbf{H} \mid \mathbf{X}_o)$ and applying guided diffusion with a differentiable task-specific loss $L(\mathbf{X}_h, \mathbf{H}, \mathbf{X}_o)$, the method achieves single-stage generation of compliant grasps with improved efficiency. The approach leverages a VN-PointNet object encoder, gripper-point grasp encoding, and DiT backbones, with DSM training and annealed Langevin MCMC for inference, plus explicit constraints on region and orientation derived from human grasps. Experimental results on OakInk show HGDiffuser outperforms two-stage baselines in both success rate and inference time (notably reducing latency by up to ~81%), and ablations confirm the importance of the DiT backbone. Real-world tests with a Franka arm validate practical applicability, while also highlighting remaining challenges in perception and pose estimation under partial observations.

Abstract

Task-oriented grasping (TOG) is essential for robots to perform manipulation tasks, requiring grasps that are both stable and compliant with task-specific constraints. Humans naturally grasp objects in a task-oriented manner to facilitate subsequent manipulation tasks. By leveraging human grasp demonstrations, current methods can generate high-quality robotic parallel-jaw task-oriented grasps for diverse objects and tasks. However, they still encounter challenges in maintaining grasp stability and sampling efficiency. These methods typically rely on a two-stage process: first performing exhaustive task-agnostic grasp sampling in the 6-DoF space, then applying demonstration-induced constraints (e.g., contact regions and wrist orientations) to filter candidates. This leads to inefficiency and potential failure due to the vast sampling space. To address this, we propose the Human-guided Grasp Diffuser (HGDiffuser), a diffusion-based framework that integrates these constraints into a guided sampling process. Through this approach, HGDiffuser directly generates 6-DoF task-oriented grasps in a single stage, eliminating exhaustive task-agnostic sampling. Furthermore, by incorporating Diffusion Transformer (DiT) blocks as the feature backbone, HGDiffuser improves grasp generation quality compared to MLP-based methods. Experimental results demonstrate that our approach significantly improves the efficiency of task-oriented grasp generation, enabling more effective transfer of human grasping strategies to robotic systems. To access the source code and supplementary videos, visit https://sites.google.com/view/hgdiffuser.

HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models

TL;DR

Abstract

HGDiffuser: Efficient Task-Oriented Grasp Generation via Human-Guided Grasp Diffusion Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)