Table of Contents
Fetching ...

PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model

Amrin Kareem, Jean Lahoud, Hisham Cholakkal

TL;DR

This work addresses the limitation of 3D perception systems in handling implicit user intents by proposing reasoning-based 3D part segmentation. It introduces the RPSeg3D dataset (2624 objects, 60k+ instructions) and the PARIS3D architecture, which renders multi-view images, leverages a multimodal reasoning backbone, and lifts per-view masks into a coherent 3D segmentation with explanations. PARIS3D achieves competitive performance against explicit-query baselines and demonstrates capability to identify part concepts, reason about them, and incorporate world knowledge through explanations. The dataset and framework advance interactive, language-driven 3D perception with practical implications for robotics and intelligent visualization, while leaving room to explore instance-level segmentation in future work.

Abstract

Recent advancements in 3D perception systems have significantly improved their ability to perform visual recognition tasks such as segmentation. However, these systems still heavily rely on explicit human instruction to identify target objects or categories, lacking the capability to actively reason and comprehend implicit user intentions. We introduce a novel segmentation task known as reasoning part segmentation for 3D objects, aiming to output a segmentation mask based on complex and implicit textual queries about specific parts of a 3D object. To facilitate evaluation and benchmarking, we present a large 3D dataset comprising over 60k instructions paired with corresponding ground-truth part segmentation annotations specifically curated for reasoning-based 3D part segmentation. We propose a model that is capable of segmenting parts of 3D objects based on implicit textual queries and generating natural language explanations corresponding to 3D object segmentation requests. Experiments show that our method achieves competitive performance to models that use explicit queries, with the additional abilities to identify part concepts, reason about them, and complement them with world knowledge. Our source code, dataset, and trained models are available at https://github.com/AmrinKareem/PARIS3D.

PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model

TL;DR

This work addresses the limitation of 3D perception systems in handling implicit user intents by proposing reasoning-based 3D part segmentation. It introduces the RPSeg3D dataset (2624 objects, 60k+ instructions) and the PARIS3D architecture, which renders multi-view images, leverages a multimodal reasoning backbone, and lifts per-view masks into a coherent 3D segmentation with explanations. PARIS3D achieves competitive performance against explicit-query baselines and demonstrates capability to identify part concepts, reason about them, and incorporate world knowledge through explanations. The dataset and framework advance interactive, language-driven 3D perception with practical implications for robotics and intelligent visualization, while leaving room to explore instance-level segmentation in future work.

Abstract

Recent advancements in 3D perception systems have significantly improved their ability to perform visual recognition tasks such as segmentation. However, these systems still heavily rely on explicit human instruction to identify target objects or categories, lacking the capability to actively reason and comprehend implicit user intentions. We introduce a novel segmentation task known as reasoning part segmentation for 3D objects, aiming to output a segmentation mask based on complex and implicit textual queries about specific parts of a 3D object. To facilitate evaluation and benchmarking, we present a large 3D dataset comprising over 60k instructions paired with corresponding ground-truth part segmentation annotations specifically curated for reasoning-based 3D part segmentation. We propose a model that is capable of segmenting parts of 3D objects based on implicit textual queries and generating natural language explanations corresponding to 3D object segmentation requests. Experiments show that our method achieves competitive performance to models that use explicit queries, with the additional abilities to identify part concepts, reason about them, and complement them with world knowledge. Our source code, dataset, and trained models are available at https://github.com/AmrinKareem/PARIS3D.
Paper Structure (17 sections, 7 equations, 6 figures, 6 tables)

This paper contains 17 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Capabilities of PARIS3D. Parts of 3D objects are segmented based on reasoning, shape, location, material, colour, and concept instructions. Additionally, for the segmentations, PARIS3D can explain why it chose that region, or describe 3D objects with respect to their parts. The original point clouds are on the left. The segmented parts are shown to the right, highlighted in golden colour.
  • Figure 2: Examples of the annotated object-instruction pairs for training with two types of queries. On the left is one view of the rendered image from the original point cloud. On the right is the corresponding ground truth segmentation mask, shown in green.
  • Figure 3: Preparing the instructions of RPSeg3D. Simple templates are provided to GPT-3.5, which populates them with part information-related segmentation instructions. In parallel, colour, shape, location, and dimension-related data is extracted from 3D point clouds. Enriching the instructions with this information and manually checking them for inaccuracies, we obtain the RPSeg3D dataset for part segmentation.
  • Figure 4: Overview of the proposed reasoning-based 3D part segmentation approach named PARIS3D. It comprises four subsequent steps: (i) The 3D point cloud is rendered into K multi-view images $x_{img}$ using a renderer. (ii) These images are passed through a frozen vision backbone ($F_{enc}$) and multimodal large language model (F) of the reasoning module. F also accepts the text query $x_{txt}$, and produces text outputs corresponding to each view. (iii) The decoder decodes the final layer embedding which contains the extra token, thus producing K segmentation masks. (iv) Finally, a mask to 3D segmentation algorithm lifts the projections back into 3D and a view-guided scoring module is used to obtain the final text response.
  • Figure 5: Qualitative results of PARIS3D's performance. We showcase examples from three tasks: reasoning 3D object part segmentation, object description, and reasoning question-answering, demonstrating its capabilities in offering in-depth reasoning, 3D understanding, part segmentation, and conversational abilities.
  • ...and 1 more figures