3D-TAFS: A Training-free Framework for 3D Affordance Segmentation
Meng Chu, Xuan Zhang, Zhedong Zheng, Tat-Seng Chua
TL;DR
3D-TAFS addresses translating natural language into precise 3D robotic actions with a training-free multimodal framework that fuses 2D vision, 3D point clouds, and language understanding. It introduces IndoorAfford-Bench, a large-scale benchmark for interactive language-guided 3D affordance segmentation. Experiments on IndoorAfford-Bench show competitive performance across metrics such as $mIoU$, $AUC$, $SIM$, and $MAE$, demonstrating robust 3D affordance segmentation in diverse indoor environments. This work advances human-robot interaction by enabling intuitive language-guided manipulation without additional training.
Abstract
Translating high-level linguistic instructions into precise robotic actions in the physical world remains challenging, particularly when considering the feasibility of interacting with 3D objects. In this paper, we introduce 3D-TAFS, a novel training-free multimodal framework for 3D affordance segmentation. To facilitate a comprehensive evaluation of such frameworks, we present IndoorAfford-Bench, a large-scale benchmark containing 9,248 images spanning 20 diverse indoor scenes across 6 areas, supporting standardized interaction queries. In particular, our framework integrates a large multimodal model with a specialized 3D vision network, enabling a seamless fusion of 2D and 3D visual understanding with language comprehension. Extensive experiments on IndoorAfford-Bench validate the proposed 3D-TAFS's capability in handling interactive 3D affordance segmentation tasks across diverse settings, showcasing competitive performance across various metrics. Our results highlight 3D-TAFS's potential for enhancing human-robot interaction based on affordance understanding in complex indoor environments, advancing the development of more intuitive and efficient robotic frameworks for real-world applications.
