Table of Contents
Fetching ...

3D-TAFS: A Training-free Framework for 3D Affordance Segmentation

Meng Chu, Xuan Zhang, Zhedong Zheng, Tat-Seng Chua

TL;DR

3D-TAFS addresses translating natural language into precise 3D robotic actions with a training-free multimodal framework that fuses 2D vision, 3D point clouds, and language understanding. It introduces IndoorAfford-Bench, a large-scale benchmark for interactive language-guided 3D affordance segmentation. Experiments on IndoorAfford-Bench show competitive performance across metrics such as $mIoU$, $AUC$, $SIM$, and $MAE$, demonstrating robust 3D affordance segmentation in diverse indoor environments. This work advances human-robot interaction by enabling intuitive language-guided manipulation without additional training.

Abstract

Translating high-level linguistic instructions into precise robotic actions in the physical world remains challenging, particularly when considering the feasibility of interacting with 3D objects. In this paper, we introduce 3D-TAFS, a novel training-free multimodal framework for 3D affordance segmentation. To facilitate a comprehensive evaluation of such frameworks, we present IndoorAfford-Bench, a large-scale benchmark containing 9,248 images spanning 20 diverse indoor scenes across 6 areas, supporting standardized interaction queries. In particular, our framework integrates a large multimodal model with a specialized 3D vision network, enabling a seamless fusion of 2D and 3D visual understanding with language comprehension. Extensive experiments on IndoorAfford-Bench validate the proposed 3D-TAFS's capability in handling interactive 3D affordance segmentation tasks across diverse settings, showcasing competitive performance across various metrics. Our results highlight 3D-TAFS's potential for enhancing human-robot interaction based on affordance understanding in complex indoor environments, advancing the development of more intuitive and efficient robotic frameworks for real-world applications.

3D-TAFS: A Training-free Framework for 3D Affordance Segmentation

TL;DR

3D-TAFS addresses translating natural language into precise 3D robotic actions with a training-free multimodal framework that fuses 2D vision, 3D point clouds, and language understanding. It introduces IndoorAfford-Bench, a large-scale benchmark for interactive language-guided 3D affordance segmentation. Experiments on IndoorAfford-Bench show competitive performance across metrics such as , , , and , demonstrating robust 3D affordance segmentation in diverse indoor environments. This work advances human-robot interaction by enabling intuitive language-guided manipulation without additional training.

Abstract

Translating high-level linguistic instructions into precise robotic actions in the physical world remains challenging, particularly when considering the feasibility of interacting with 3D objects. In this paper, we introduce 3D-TAFS, a novel training-free multimodal framework for 3D affordance segmentation. To facilitate a comprehensive evaluation of such frameworks, we present IndoorAfford-Bench, a large-scale benchmark containing 9,248 images spanning 20 diverse indoor scenes across 6 areas, supporting standardized interaction queries. In particular, our framework integrates a large multimodal model with a specialized 3D vision network, enabling a seamless fusion of 2D and 3D visual understanding with language comprehension. Extensive experiments on IndoorAfford-Bench validate the proposed 3D-TAFS's capability in handling interactive 3D affordance segmentation tasks across diverse settings, showcasing competitive performance across various metrics. Our results highlight 3D-TAFS's potential for enhancing human-robot interaction based on affordance understanding in complex indoor environments, advancing the development of more intuitive and efficient robotic frameworks for real-world applications.
Paper Structure (18 sections, 7 figures, 1 table)

This paper contains 18 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Comparison of 2D affordance segmentation and interactive 3D affordance segmentation. While 2D segmentation offers simplicity for static image analysis, interactive 3D segmentation introduces interactivity, multimodal processing, and richer spatial understanding.
  • Figure 2: Demonstration of possible affordance in different environments. This image provides a comprehensive overview of human-object interactions across four common domestic environments: kitchen, working space, living room, and bedroom. By mapping out specific objects in each space and their associated actions, it offers valuable insights into how people engage with their surroundings daily.
  • Figure 3: Sturcture and working flow of 3D-TAFS. Our framework integrates vision-language processing with 3D affordance segmentation for robotic action guidance. It depicts two parallel input streams: visual input undergoing linear projection and multi-head attention and textual input processing through multi-head attention and feed-forward networks. These streams converge in a language model, enabling cross-modal understanding. Then, it decides to do object label identification to find the standard 3D point cloud. Finally, the framework starts to do the 3D affordance segmentation. This architecture demonstrates the seamless integration of computer vision, natural language processing, and robotics to create a sophisticated framework capable of understanding and interacting with its environment in a human-like manner.
  • Figure 4: Dataset overview.(a) Comprehensive statistics of our dataset, including basic counts, averages, and distribution information. (b) The data collection and processing workflow of our dataset.
  • Figure 5: Visualization of room classification performance across different indoor spaces. The scatter plot displays the relationship between mIoU and SIM, with bubble sizes indicating MAE.
  • ...and 2 more figures