Table of Contents
Fetching ...

HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

Lei Yao, Yong Chen, Yuejiao Su, Yi Wang, Moyun Liu, Lap-Pui Chau

TL;DR

This work proposes a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER, and devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement.

Abstract

Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.

HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

TL;DR

This work proposes a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER, and devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement.

Abstract

Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.
Paper Structure (35 sections, 12 equations, 10 figures, 16 tables)

This paper contains 35 sections, 12 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Comparison of current architectures and HAMMER. (a) GREAT shao2024great generates texts to assist the fusion process. (b) InteractVLM dwivedi_interactvlm_2025 first produces 2D masks and back-projects them into 3D space. (c) HAMMER enhances point cloud features with cross-modal information from MLLMs and lifts the extracted intention embedding to 3D for accurate affordance localization.
  • Figure 2: Overview of our HAMMER. Given a 3D point cloud ${\mathbf{P}}$ and its corresponding interaction image ${\mathbf{I}}$, our framework first processes ${\mathbf{I}}$ through a pre-trained MLLM ${\mathcal{F}}_{\theta}$ to extract an affordance-guided intention embedding ${\bm{f}}_c$ (Sec. \ref{['subsec:intention-embedding']}). This embedding is then used to enhance point cloud features via a hierarchical cross-modal integration mechanism (Sec. \ref{['subsec:cross-modal-feature-enhancement']}). To imbue ${\bm{f}}_c$ with 3D spatial awareness, we apply a multi-granular geometry lifting module that incorporates multi-scale geometric cues (Sec. \ref{['subsec:geometry-lifting']}). Finally, the refined point features $\tilde{{\bm{f}}}_p$ and the 3D-aware intention embedding ${\bm{f}}_c^{3D}$ are decoded to produce the final affordance map ${\bm{p}}$ (Sec. \ref{['subsec:affordance-decoding']}).
  • Figure 3: Qualitative comparison on PIAD yang2023grounding. Our HAMMER generates more precise and complete affordance predictions compared to GREAT shao2024great, highlighting its enhanced capability in understanding interaction intentions and reasoning about 3D affordances.
  • Figure 4: Qualitative comparison on corrupted point clouds. The proposed HAMMER realizes stronger robustness in accurately localizing affordance regions under severe noise corruption, while GREAT shao2024great struggles to maintain reliable predictions.
  • Figure 5: Impact of MLLM backbones (left) and visualization of point-wise features via PCA (right).
  • ...and 5 more figures