Table of Contents
Fetching ...

Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

Xianqiang Gao, Pingrui Zhang, Delin Qu, Dong Wang, Zhigang Wang, Yan Ding, Bin Zhao

TL;DR

The paper tackles 3D Object Affordance Grounding by addressing generalization gaps when learning from a single reference image. It introduces MIFAG, which learns invariant affordance knowledge from multiple human-object interaction images via the Invariant Affordance Knowledge Extraction Module (IAM) and fuses this knowledge with 3D point clouds through the Affordance Dictionary Adaptive Fusion Module (ADM). A new Multi-Image and Point Affordance (MIPA) dataset is constructed to benchmark cross-image and point-cloud grounding. Experiments show state-of-the-art performance on seen and unseen data and demonstrate robust real-world generalization with LiDAR- and camera-based inputs, highlighting the practical potential for robotics and embodied perception.

Abstract

3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the \textbf{M}ulti-\textbf{I}mage Guided Invariant-\textbf{F}eature-Aware 3D \textbf{A}ffordance \textbf{G}rounding (\textbf{MIFAG}) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (\textbf{IAM}) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (\textbf{ADM}) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (\textbf{MIPA}) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons. Project page: \url{https://goxq.github.io/mifag}

Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

TL;DR

The paper tackles 3D Object Affordance Grounding by addressing generalization gaps when learning from a single reference image. It introduces MIFAG, which learns invariant affordance knowledge from multiple human-object interaction images via the Invariant Affordance Knowledge Extraction Module (IAM) and fuses this knowledge with 3D point clouds through the Affordance Dictionary Adaptive Fusion Module (ADM). A new Multi-Image and Point Affordance (MIPA) dataset is constructed to benchmark cross-image and point-cloud grounding. Experiments show state-of-the-art performance on seen and unseen data and demonstrate robust real-world generalization with LiDAR- and camera-based inputs, highlighting the practical potential for robotics and embodied perception.

Abstract

3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the \textbf{M}ulti-\textbf{I}mage Guided Invariant-\textbf{F}eature-Aware 3D \textbf{A}ffordance \textbf{G}rounding (\textbf{MIFAG}) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (\textbf{IAM}) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (\textbf{ADM}) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (\textbf{MIPA}) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons. Project page: \url{https://goxq.github.io/mifag}
Paper Structure (45 sections, 12 equations, 9 figures, 5 tables)

This paper contains 45 sections, 12 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Motivation of Our Method. The reference human-object images exhibit significant variations in appearance, yet they consistently imply the same affordance knowledge. We propose to iteratively extract the invariant affordance knowledge from multiple images, leading to improved performance.
  • Figure 2: Overview of our proposed MIFAG. (a) The IAM utilizes a multi-layer network and a dual-branch structure to gradually extract invariant affordance knowledge and minimize interference caused by appearance variations in the images. (b) The ADM leverages the invariant affordance knowledge dictionary derived from (a), using dictionary-based cross attention and self-weighted attention to comprehensively fuse the affordance knowledge with point cloud representations.
  • Figure 3: Affordance Visualization on MIPA dataset. Compared with LASO li2024laso and IAGNet yang2023grounding, the proposed MIFAG achieves more accurate results in both seen and unseen settings.
  • Figure 4: t-SNE visualization of affordance queries. Tokens query corresponding to the same operation across different object clusters in the region.
  • Figure 5: Real-World Visualization.Left: Original 3D point clouds scanned by an iPhone 15 Pro. Middle: Reference images. Right: Affordance prediction results on the scanned point cloud.
  • ...and 4 more figures