Table of Contents
Fetching ...

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang, Xiaobin Hu, Xiangtai Li, Yabiao Wang, Yong Liu

TL;DR

OV-SAM3D is introduced, a training-free method that contains a universal framework for understanding open-vocabulary 3D scenes without requiring prior knowledge of the scene and surpasses existing open-vocabulary methods in unknown open-world environments.

Abstract

Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent works have sought to transfer knowledge embedded in vision-language models from 2D to 3D domains. However, these approaches often require prior knowledge from specific 3D scene datasets, limiting their applicability in open-world scenarios. The Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities, prompting us to investigate its potential for comprehending 3D scenes without training. In this paper, we introduce OV-SAM3D, a training-free method that contains a universal framework for understanding open-vocabulary 3D scenes. This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene. Specifically, our method is composed of two key sub-modules: First, we initiate the process by generating superpoints as the initial 3D prompts and refine these prompts using segment masks derived from SAM. Moreover, we then integrate a specially designed overlapping score table with open tags from the Recognize Anything Model (RAM) to produce final 3D instances with open-world labels. Empirical evaluations on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments.

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

TL;DR

OV-SAM3D is introduced, a training-free method that contains a universal framework for understanding open-vocabulary 3D scenes without requiring prior knowledge of the scene and surpasses existing open-vocabulary methods in unknown open-world environments.

Abstract

Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent works have sought to transfer knowledge embedded in vision-language models from 2D to 3D domains. However, these approaches often require prior knowledge from specific 3D scene datasets, limiting their applicability in open-world scenarios. The Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities, prompting us to investigate its potential for comprehending 3D scenes without training. In this paper, we introduce OV-SAM3D, a training-free method that contains a universal framework for understanding open-vocabulary 3D scenes. This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene. Specifically, our method is composed of two key sub-modules: First, we initiate the process by generating superpoints as the initial 3D prompts and refine these prompts using segment masks derived from SAM. Moreover, we then integrate a specially designed overlapping score table with open tags from the Recognize Anything Model (RAM) to produce final 3D instances with open-world labels. Empirical evaluations on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments.
Paper Structure (32 sections, 2 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 2 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Paradigm comparison with contemporary methods. (a): SAM3D yang2023sam3d leverages SAM to achieve zero-shot 3D segmentation but only generates the class-agnostic segmentation results. (b): OpenMask3D takmaz2023openmask3d implements open-vocabulary scene understanding but requires training a 3D proposal network under supervision. (c): Our OV-SAM3D model effectively transfers the extensive knowledge embedded in the SAM to the 3D domain and enhances instance segmentation and recognition without requiring additional training. This capability enables open-vocabulary 3D scene understanding across various environments.
  • Figure 2: Overview of our OV-SAM3D that consists of two sub-modules: 1) SAM-centric Coarse Mask Generation first adapts a graph-based over-segmentation method to generate superpoints and selects some as initial 3D prompts to guide SAM. Then we revise the initial 3D prompts through the masks of SAM and create an overlapping score table. 2) Open Tags Guided Coarse Mask Merging combines the overlapping score table and reasonable open tags recognized by RAM, thus we can achieve the open-vocabulary 3D scene understanding task to get 3D instances with labels.
  • Figure 3: Visualization of overlapping scores. We calculate the overlapping scores between queried coarse mask with other all other coarse masks. If the score is higher than a certain threshold, it indicates that these two coarse masks belong to the same instance. Here, we visualize some overlapping scores from high to low.
  • Figure 4: What issues can updatable merging strategy solve ? When a large instance is separated in different views, their overlap region may be minimal and cannot be directly merged as shown on the top. Through our updatable merging strategy, the progressive mask can gradually enlarge and meet the merging conditions.
  • Figure 5: Generation of open instance tags. We show how to generate the open instance tags through RAM zhang2023recognize and ChatGPT. The red tags from RAM are not instance labels that need to be filtered by ChatGPT.
  • ...and 3 more figures