Table of Contents
Fetching ...

Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

Tianrun Chen, Chunan Yu, Jing Li, Jianqi Zhang, Lanyun Zhu, Deyi Ji, Yong Zhang, Ying Zang, Zejian Li, Lingyun Sun

TL;DR

Reasoning3D tackles zero-shot 3D reasoning segmentation by leveraging multi-view rendering of 3D meshes and open-vocabulary prompts processed by vision-language foundations and large language models. The method produces 2D segmentation masks with explanations per view and fuses them into a coherent 3D segmentation via a face-ID mapping and refinement steps such as Gaussian geodesic reweighting and visibility smoothing. It is training-free and demonstrates generalization to articulated objects and real-world scans, offering a practical baseline for future part-level 3D understanding. The authors release code, weights, a deployment guide, and evaluation protocol to accelerate research and deployment in robotics, AR/VR, autonomous driving, and medical fields.

Abstract

In this paper, we introduce a new task: Zero-Shot 3D Reasoning Segmentation for parts searching and localization for objects, which is a new paradigm to 3D segmentation that transcends limitations for previous category-specific 3D semantic segmentation, 3D instance segmentation, and open-vocabulary 3D segmentation. We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands for (fine-grained) segmenting specific parts for 3D meshes with contextual awareness and reasoned answers for interactive segmentation. Specifically, Reasoning3D leverages an off-the-shelf pre-trained 2D segmentation network, powered by Large Language Models (LLMs), to interpret user input queries in a zero-shot manner. Previous research have shown that extensive pre-training endows foundation models with prior world knowledge, enabling them to comprehend complex commands, a capability we can harness to "segment anything" in 3D with limited 3D datasets (source efficient). Experimentation reveals that our approach is generalizable and can effectively localize and highlight parts of 3D objects (in 3D mesh) based on implicit textual queries, including these articulated 3d objects and real-world scanned data. Our method can also generate natural language explanations corresponding to these 3D models and the decomposition. Moreover, our training-free approach allows rapid deployment and serves as a viable universal baseline for future research of part-level 3d (semantic) object understanding in various fields including robotics, object manipulation, part assembly, autonomous driving applications, augment reality and virtual reality (AR/VR), and medical applications. The code, the model weight, the deployment guide, and the evaluation protocol are: http://tianrun-chen.github.io/Reason3D/

Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

TL;DR

Reasoning3D tackles zero-shot 3D reasoning segmentation by leveraging multi-view rendering of 3D meshes and open-vocabulary prompts processed by vision-language foundations and large language models. The method produces 2D segmentation masks with explanations per view and fuses them into a coherent 3D segmentation via a face-ID mapping and refinement steps such as Gaussian geodesic reweighting and visibility smoothing. It is training-free and demonstrates generalization to articulated objects and real-world scans, offering a practical baseline for future part-level 3D understanding. The authors release code, weights, a deployment guide, and evaluation protocol to accelerate research and deployment in robotics, AR/VR, autonomous driving, and medical fields.

Abstract

In this paper, we introduce a new task: Zero-Shot 3D Reasoning Segmentation for parts searching and localization for objects, which is a new paradigm to 3D segmentation that transcends limitations for previous category-specific 3D semantic segmentation, 3D instance segmentation, and open-vocabulary 3D segmentation. We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands for (fine-grained) segmenting specific parts for 3D meshes with contextual awareness and reasoned answers for interactive segmentation. Specifically, Reasoning3D leverages an off-the-shelf pre-trained 2D segmentation network, powered by Large Language Models (LLMs), to interpret user input queries in a zero-shot manner. Previous research have shown that extensive pre-training endows foundation models with prior world knowledge, enabling them to comprehend complex commands, a capability we can harness to "segment anything" in 3D with limited 3D datasets (source efficient). Experimentation reveals that our approach is generalizable and can effectively localize and highlight parts of 3D objects (in 3D mesh) based on implicit textual queries, including these articulated 3d objects and real-world scanned data. Our method can also generate natural language explanations corresponding to these 3D models and the decomposition. Moreover, our training-free approach allows rapid deployment and serves as a viable universal baseline for future research of part-level 3d (semantic) object understanding in various fields including robotics, object manipulation, part assembly, autonomous driving applications, augment reality and virtual reality (AR/VR), and medical applications. The code, the model weight, the deployment guide, and the evaluation protocol are: http://tianrun-chen.github.io/Reason3D/
Paper Structure (17 sections, 7 equations, 6 figures, 2 tables)

This paper contains 17 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: In this work, we propose a new task: reasoning 3D segmentation. We also propose a method that can segment 3D object parts with explanations based on various criteria such as reasoning, shape, location, function, and conceptual instructions.
  • Figure 2: The overview of Reasoning3D. First, a 3D model represented by 3D meshes is fed into a renderer to obtain multi-view images. Then, each image goes through a vision backbone and a multi-modal LLM along with user input queries. The decoder decodes the final layer embedding which contains the extra token, thus producing K segmentation masks. We also extract the bounding boxes in this stage. Finally, a specially designed mask-to-3D segmentation algorithm elevates the projections back into the 3D space.
  • Figure 3: Qualitative results and comparison between our method and baseline method in FAUST benchmark. The segmented regions are shown in red.
  • Figure 4: A natural language command can make the model segment the desired regions. The segmented regions are shown in red.
  • Figure 5: We offer a user-friendly interface designed for performance assessment, facilitating the easy upload of 3D models and prompts by users. It enables swift acquisition of 3D segmentation outcomes. This tailored software is available as open-source.
  • ...and 1 more figures