Table of Contents
Fetching ...

SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model

Chunlin Yu, Hanqing Wang, Ye Shi, Haoyang Luo, Sibei Yang, Jingyi Yu, Jingya Wang

TL;DR

SeqAfford introduces the Sequential 3D Affordance Reasoning task and a large-scale instruction-point cloud benchmark to address long-horizon, multi-object grounding in 3D scenes. It proposes SeqAfford, a 3D multimodal large language model that integrates a 3D vision encoder with a ShapeLLM-based backbone and a multi-granular language-point integration module to reason and ground sequential affordances, generating a sequence of segmentation masks. The approach demonstrates superior performance on both single and sequential affordance tasks, with strong open-world generalization and ablations validating the critical role of the MGLP module and backbone choices. The work paves the way for embodied agents capable of interpreting complex human instructions and executing multi-step 3D manipulations grounded in language and perception, with broad implications for robotics and interactive AI.

Abstract

3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the benchmark, we propose our model, SeqAfford, to unlock the 3D multi-modal large language model with additional affordance segmentation abilities, which ensures reasoning with world knowledge and fine-grained affordance grounding in a cohesive framework. We further introduce a multi-granular language-point integration module to endow 3D dense prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization with sequential reasoning abilities.

SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model

TL;DR

SeqAfford introduces the Sequential 3D Affordance Reasoning task and a large-scale instruction-point cloud benchmark to address long-horizon, multi-object grounding in 3D scenes. It proposes SeqAfford, a 3D multimodal large language model that integrates a 3D vision encoder with a ShapeLLM-based backbone and a multi-granular language-point integration module to reason and ground sequential affordances, generating a sequence of segmentation masks. The approach demonstrates superior performance on both single and sequential affordance tasks, with strong open-world generalization and ablations validating the critical role of the MGLP module and backbone choices. The work paves the way for embodied agents capable of interpreting complex human instructions and executing multi-step 3D manipulations grounded in language and perception, with broad implications for robotics and interactive AI.

Abstract

3D affordance segmentation aims to link human instructions to touchable regions of 3D objects for embodied manipulations. Existing efforts typically adhere to single-object, single-affordance paradigms, where each affordance type or explicit instruction strictly corresponds to a specific affordance region and are unable to handle long-horizon tasks. Such a paradigm cannot actively reason about complex user intentions that often imply sequential affordances. In this paper, we introduce the Sequential 3D Affordance Reasoning task, which extends the traditional paradigm by reasoning from cumbersome user intentions and then decomposing them into a series of segmentation maps. Toward this, we construct the first instruction-based affordance segmentation benchmark that includes reasoning over both single and sequential affordances, comprising 180K instruction-point cloud pairs. Based on the benchmark, we propose our model, SeqAfford, to unlock the 3D multi-modal large language model with additional affordance segmentation abilities, which ensures reasoning with world knowledge and fine-grained affordance grounding in a cohesive framework. We further introduce a multi-granular language-point integration module to endow 3D dense prediction. Extensive experimental evaluations show that our model excels over well-established methods and exhibits open-world generalization with sequential reasoning abilities.

Paper Structure

This paper contains 30 sections, 10 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Sequential 3D affordance reasoning task with different types of interactions. We introduce SeqAfford, a Multi-Modal Language Model (MLLM) capable of serialized affordance inference implied in human instructions: 1) Single Affordance Reasoning; 2) Sequential Affordance Reasoning; 3) Sequential Affordance Reasoning with Multiple Objects
  • Figure 2: Preparing the instructions. To better utilize the world knowledge of GPT4, We prompt GPT-4o to generate diverse instructions based on 4 types of system prompts containing different modalities as input. Instructions are generated based on input prompts with modalities from a) purely textual affordance type, object name; b) the mesh-rendered image of the object; c) the mesh-rendered image and HOI images that reveal affordances of the object; d) the mesh-rendered image and textual description of the scenario.
  • Figure 3: Main Pipeline. Given the point clouds of the target objects and a piece of complex human instruction, SeqAfford first reasons from this instruction and decomposes it into several hidden <SEG> tokens extracted from the last-layer embeddings, each representing an intermediate affordance segmentation result. Then, for each <SEG>, the point features extracted by the 3D vision encoder dynamically interact with the <SEG> token before being sent to the decoder for mask generation. The interaction is achieved through multi-granular language-point integration, synergizing both reasoning and affordance segmentation. We use LoRA for efficient fine-tuning.
  • Figure 4: Multi-Granular Language-Point Integration Module. We propose an interaction module between <SEG> tokens from LLM and point features from the 3D vision encoder, to synergize both reasoning and segmentation in a cohesive framework. This module consists of the multi-granular feature propagation process, and the point-language integration stage.
  • Figure 5: Qualitative results of our model. SeqAfford understands human instruction and accurately segments the target affordance.
  • ...and 12 more figures