Table of Contents
Fetching ...

AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers

Nghia Vu, Tuong Do, Khang Nguyen, Baoru Huang, Nhat Le, Binh Xuan Nguyen, Erman Tjiputra, Quang D. Tran, Ravi Prakash, Te-Chuan Chiu, Anh Nguyen

Abstract

Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward. In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within the scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach compared to other methods.

AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers

Abstract

Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward. In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within the scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that establishes coherent semantic correspondences between image-based and point cloud-based instances for keypoint matching, enabling a more precise identification of affordance regions based on cues, so-called visual signifiers. Experimental results on our dataset demonstrate the effectiveness of our approach compared to other methods.

Paper Structure

This paper contains 19 sections, 11 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overview of AffordMatcher: Detecting and localizing affordances in 3D voxelized scenes through visual signifiers entails semantic context drawn from RGB images. Given a scene representation and visual signifiers, AffordMatcher can understand actionable commands, such as "watch the television", "push the tip", "rotate pull", or "open the chimney", and identify spatial affordances.
  • Figure 2: Construction of the AffordBridge dataset: Our AffordBridge dataset is built through a semi-supervised pipeline linking visual signifiers with 3D affordances. The building process includes (i) 3D scene processing via voxelized point clouds with object-view filtering through visual scanning, (ii) visual signifiers processing with human-object interaction extraction with fine-grained captioning, and (iii) affordance annotation by matching key views to 3D instances for spatial action labeling.
  • Figure 3: Dataset statistics: Statistics of objects in human-object interactions yielding affordances in our AffordBridge dataset.
  • Figure 4: Design architecture of AffordMatcher: Given a high-resolution voxelized scene point cloud and a visual signifier, AffordMatcher reasons over these inputs for zero-shot affordance segmentation. The affordance extractor identifies 3D interactable regions, while the reasoning extractor encodes 2D human-object cues. Cross-modal alignment is achieved via instance matching through a dissimilarity matrix. The features from the dissimilarity matrix are thus optimized through match-to-match attention, followed by a zero-shot affordance optimization to localize actionable spatial regions that align with the given signifier.
  • Figure 5: Attention visualization: From the visual signifier in the RGB image and the text "Rest on Pillow", AffordMatcher focuses on the pillow area in the RGB image and correctly localizes the corresponding affordance regions in the high-resolution voxelized indoor scene.
  • ...and 7 more figures