Table of Contents
Fetching ...

3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds

Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye Hao, Liqiang Nie

TL;DR

This work tackles open-world 3D affordance detection by reframing the problem as Instruction Reasoning Affordance Segmentation (IRAS) and introducing 3D-ADLLM, a framework that injects large language model reasoning into 3D perception through an Affordance Decoder. A two-stage training regime—Referring Object Part Segmentation (ROPS) pretraining on PartNet to capture general object-part segmentation knowledge, followed by IRAS finetuning with an end-to-end loss that combines text and mask supervision—enables effective transfer from general segmentation to affordance-specific reasoning. Empirical results show that 3D-ADLLM achieves substantial gains over state-of-the-art baselines in open-vocabulary and zero-shot settings, including notable improvements in mIoU and mAP50 on full-view and partial-view tasks, as well as strong out-of-distribution generalization on AffordPose. The approach demonstrates the practical potential of integrating LLMs with 3D perception for flexible, instruction-driven interaction in unseen scenes, suggesting a promising direction for embodied AI and robotics.

Abstract

3D Affordance detection is a challenging problem with broad applications on various robotic tasks. Existing methods typically formulate the detection paradigm as a label-based semantic segmentation task. This paradigm relies on predefined labels and lacks the ability to comprehend complex natural language, resulting in limited generalization in open-world scene. To address these limitations, we reformulate the traditional affordance detection paradigm into \textit{Instruction Reasoning Affordance Segmentation} (IRAS) task. This task is designed to output a affordance mask region given a query reasoning text, which avoids fixed categories of input labels. We accordingly propose the \textit{3D-AffordanceLLM} (3D-ADLLM), a framework designed for reasoning affordance detection in 3D open-scene. Specifically, 3D-ADLLM introduces large language models (LLMs) to 3D affordance perception with a custom-designed decoder for generating affordance masks, thus achieving open-world reasoning affordance detection. In addition, given the scarcity of 3D affordance datasets for training large models, we seek to extract knowledge from general segmentation data and transfer it to affordance detection. Thus, we propose a multi-stage training strategy that begins with a novel pre-training task, i.e., \textit{Referring Object Part Segmentation}~(ROPS). This stage is designed to equip the model with general recognition and segmentation capabilities at the object-part level. Then followed by fine-tuning with the IRAS task, 3D-ADLLM obtains the reasoning ability for affordance detection. In summary, 3D-ADLLM leverages the rich world knowledge and human-object interaction reasoning ability of LLMs, achieving approximately an 8\% improvement in mIoU on open-vocabulary affordance detection tasks.

3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds

TL;DR

This work tackles open-world 3D affordance detection by reframing the problem as Instruction Reasoning Affordance Segmentation (IRAS) and introducing 3D-ADLLM, a framework that injects large language model reasoning into 3D perception through an Affordance Decoder. A two-stage training regime—Referring Object Part Segmentation (ROPS) pretraining on PartNet to capture general object-part segmentation knowledge, followed by IRAS finetuning with an end-to-end loss that combines text and mask supervision—enables effective transfer from general segmentation to affordance-specific reasoning. Empirical results show that 3D-ADLLM achieves substantial gains over state-of-the-art baselines in open-vocabulary and zero-shot settings, including notable improvements in mIoU and mAP50 on full-view and partial-view tasks, as well as strong out-of-distribution generalization on AffordPose. The approach demonstrates the practical potential of integrating LLMs with 3D perception for flexible, instruction-driven interaction in unseen scenes, suggesting a promising direction for embodied AI and robotics.

Abstract

3D Affordance detection is a challenging problem with broad applications on various robotic tasks. Existing methods typically formulate the detection paradigm as a label-based semantic segmentation task. This paradigm relies on predefined labels and lacks the ability to comprehend complex natural language, resulting in limited generalization in open-world scene. To address these limitations, we reformulate the traditional affordance detection paradigm into \textit{Instruction Reasoning Affordance Segmentation} (IRAS) task. This task is designed to output a affordance mask region given a query reasoning text, which avoids fixed categories of input labels. We accordingly propose the \textit{3D-AffordanceLLM} (3D-ADLLM), a framework designed for reasoning affordance detection in 3D open-scene. Specifically, 3D-ADLLM introduces large language models (LLMs) to 3D affordance perception with a custom-designed decoder for generating affordance masks, thus achieving open-world reasoning affordance detection. In addition, given the scarcity of 3D affordance datasets for training large models, we seek to extract knowledge from general segmentation data and transfer it to affordance detection. Thus, we propose a multi-stage training strategy that begins with a novel pre-training task, i.e., \textit{Referring Object Part Segmentation}~(ROPS). This stage is designed to equip the model with general recognition and segmentation capabilities at the object-part level. Then followed by fine-tuning with the IRAS task, 3D-ADLLM obtains the reasoning ability for affordance detection. In summary, 3D-ADLLM leverages the rich world knowledge and human-object interaction reasoning ability of LLMs, achieving approximately an 8\% improvement in mIoU on open-vocabulary affordance detection tasks.

Paper Structure

This paper contains 26 sections, 13 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The comparison of the affordance detection paradigm based on our IRAS or traditional label-based segmentation tasks. (a) shows that label-based paradigm can only detect the fixed set of affordance regions through the predefined label and seg-head; (b) demonstrates the IRAS based paradigm forges a link between semantic complex instruction and object affordance, enabling open-world reasoning affordance detection.
  • Figure 2: The Pipeline of 3D-ADLLM. Given the input point cloud and query reasoning instruction, the point cloud multimodal model is trained with lora to predict special token <AFF>. Finally, the special token and dense point features from $f_\mathrm{PB}$ is fed into our designed affordance decoder to generate the final affordance mask.
  • Figure 3: Multi-stage training strategy. Illustration of transferring general segmentation knowledge to affordance detection. (a) depicts the process of extracting general segmentation knowledge, while (b) illustrates the framework for transferring this knowledge to affordance detection
  • Figure 4: The visualization results of our 3D-ADLLM compared with others.
  • Figure 5: The analysis of IRAS task.
  • ...and 2 more figures