Table of Contents
Fetching ...

Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model

Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, Ming-Hsuan Yang

TL;DR

Reason3D introduces an LLM-guided framework for dense 3D segmentation that can synthesize text-based reasoning with precise 3D masks. The approach aligns point-cloud features with a frozen decoder-only LLM via an Interactor and employs a hierarchical mask decoder that generates a coarse region prior ([LOC]) followed by a refined object mask ([SEG]), enabling 3D reasoning segmentation, hierarchical searching, expressive referring, and QA. Extensive experiments on ScanNet and Matterport3D show state-of-the-art or competitive performance across 3D reasoning segmentation, hierarchical searching, 3D referring, and 3D QA, validating the effectiveness of the coarse-to-fine decoding and token-guided priors. The work provides a new dataset and prompts for 3D reasoning tasks, highlighting practical implications for interactive 3D understanding and potential limitations related to scale, false premises, and bias.

Abstract

Recent advancements in multimodal large language models (LLMs) have demonstrated significant potential across various domains, particularly in concept reasoning. However, their applications in understanding 3D environments remain limited, primarily offering textual or numerical outputs without generating dense, informative segmentation masks. This paper introduces Reason3D, a novel LLM designed for comprehensive 3D understanding. Reason3D processes point cloud data and text prompts to produce textual responses and segmentation masks, enabling advanced tasks such as 3D reasoning segmentation, hierarchical searching, express referring, and question answering with detailed mask outputs. We propose a hierarchical mask decoder that employs a coarse-to-fine approach to segment objects within expansive scenes. It begins with a coarse location estimation, followed by object mask estimation, using two unique tokens predicted by LLMs based on the textual query. Experimental results on large-scale ScanNet and Matterport3D datasets validate the effectiveness of our Reason3D across various tasks.

Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model

TL;DR

Reason3D introduces an LLM-guided framework for dense 3D segmentation that can synthesize text-based reasoning with precise 3D masks. The approach aligns point-cloud features with a frozen decoder-only LLM via an Interactor and employs a hierarchical mask decoder that generates a coarse region prior ([LOC]) followed by a refined object mask ([SEG]), enabling 3D reasoning segmentation, hierarchical searching, expressive referring, and QA. Extensive experiments on ScanNet and Matterport3D show state-of-the-art or competitive performance across 3D reasoning segmentation, hierarchical searching, 3D referring, and 3D QA, validating the effectiveness of the coarse-to-fine decoding and token-guided priors. The work provides a new dataset and prompts for 3D reasoning tasks, highlighting practical implications for interactive 3D understanding and potential limitations related to scale, false premises, and bias.

Abstract

Recent advancements in multimodal large language models (LLMs) have demonstrated significant potential across various domains, particularly in concept reasoning. However, their applications in understanding 3D environments remain limited, primarily offering textual or numerical outputs without generating dense, informative segmentation masks. This paper introduces Reason3D, a novel LLM designed for comprehensive 3D understanding. Reason3D processes point cloud data and text prompts to produce textual responses and segmentation masks, enabling advanced tasks such as 3D reasoning segmentation, hierarchical searching, express referring, and question answering with detailed mask outputs. We propose a hierarchical mask decoder that employs a coarse-to-fine approach to segment objects within expansive scenes. It begins with a coarse location estimation, followed by object mask estimation, using two unique tokens predicted by LLMs based on the textual query. Experimental results on large-scale ScanNet and Matterport3D datasets validate the effectiveness of our Reason3D across various tasks.
Paper Structure (30 sections, 7 equations, 8 figures, 9 tables)

This paper contains 30 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview. We propose Reason3D, a novel LLM-based 3D point cloud searching and reasoning framework that can output dense segmentation masks based on textural descriptions. Our Reason3D can handle four tasks involving 1) 3D Reasoning, 2) 3D Hierarchical Searching, 3) 3D Express Referring, and 4) 3D QA with responding dense segmentation masks.
  • Figure 2: Annotated Sample Examples. (a) shows a sample from the Matterport3D dataset with the answer pool table. (b) presents a sample from the ScannetV2 dataset with the answer fireplace.
  • Figure 3: Overview of our Reason3D framework. Initially, we utilize a point encoder to extract point features from the input scene, which are simplified by a superpoint pooling layer to reduce complexity. An interactor merges these superpoint features with a learnable query, input into a frozen LLM along with instructions to generate an output containing specifical tokens, [LOC] and [SEG]. A hierarchical mask decoder then utilizes the [LOC] embedding to estimate a coarse location that likely covers the target object. Finally, this estimated location prior is integrated with the [SEG] embedding to enable the prediction of the final segmentation masks.
  • Figure 4: Visualization Results for 3D Reasoning Segmentation Tasks. Each sub-figure presents a textual query alongside the input point cloud. The purple regions highlight the predicted segmentation masks generated by our model.
  • Figure 5: Visualization Results for 3D Reasoning Segmentation Tasks. The purple regions highlight the predicted segmentation masks generated by our model. Best viewed with zoom in.
  • ...and 3 more figures