AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis

Xiaofei Wu; Yi Zhang; Yumeng Liu; Yuexin Ma; Yujiao Shi; Xuming He

AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis

Xiaofei Wu, Yi Zhang, Yumeng Liu, Yuexin Ma, Yujiao Shi, Xuming He

TL;DR

This work introduces a scalable annotation pipeline that automatically enriches hand-object interaction datasets with fine-grained structured language labels capturing interaction intent, and presents AffordGrasp, a diffusion-based framework that produces physically stable and semantically faithful human grasps with high precision.

Abstract

Generating human grasping poses that accurately reflect both object geometry and user-specified interaction semantics is essential for natural hand-object interactions in AR/VR and embodied AI. However, existing semantic grasping approaches struggle with the large modality gap between 3D object representations and textual instructions, and often lack explicit spatial or semantic constraints, leading to physically invalid or semantically inconsistent grasps. In this work, we present AffordGrasp, a diffusion-based framework that produces physically stable and semantically faithful human grasps with high precision. We first introduce a scalable annotation pipeline that automatically enriches hand-object interaction datasets with fine-grained structured language labels capturing interaction intent. Building upon these annotations, AffordGrasp integrates an affordance-aware latent representation of hand poses with a dual-conditioning diffusion process, enabling the model to jointly reason over object geometry, spatial affordances, and instruction semantics. A distribution adjustment module further enforces physical contact consistency and semantic alignment. We evaluate AffordGrasp across four instruction-augmented benchmarks derived from HO-3D, OakInk, GRAB, and AffordPose, and observe substantial improvements over state-of-the-art methods in grasp quality, semantic accuracy, and diversity.

AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis

TL;DR

Abstract

Paper Structure (33 sections, 32 equations, 19 figures, 10 tables)

This paper contains 33 sections, 32 equations, 19 figures, 10 tables.

Introduction
Related Work
Grasp Synthesis.
Affordance in Hand-Object Interaction.
Denoising Diffusion Probabilistic Models.
Approach - AffordGrasp
Affordance Generator
Text and Affordance Guided Grasp Generation
Distribution Adjustment Module
Inference
Experiment
Automated Labeling for Dataset Enrichment
Evaluation Metrics
Grasp Generation Performance
Ablation Study
...and 18 more sections

Figures (19)

Figure 1: Overview of AffordGrasp.We integrate language instructions with object point clouds and employ the Affordance Generator to predict point-wise confidence features, which are aggregated into the final affordance map to enhance spatial detail and align linguistic semantics with 3D structures.The right part employs a DAM module, which ensures that the synthesized grasping poses generated by the LDM model align with physical constraints and language semantics.
Figure 2: Distribution Adjustment Module (DAM) Architecture. Hand and object features are fused and aligned with language instructions to produce stable, instruction-consistent grasps.
Figure 3: Affordance Annotation. Implement an automated self-training pipeline that first assigns pseudo-labels to unlabeled data, then iteratively optimizes the model using these refined annotations.
Figure 5: Simulation environment: grasping a single object under different instructions.
Figure 6: Simulation environment: grasping across multiple objects.
...and 14 more figures

AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis

TL;DR

Abstract

AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (19)