Table of Contents
Fetching ...

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Tri Ton, Ji Woo Hong, SooHwan Eom, Jun Yeop Shim, Junyeong Kim, Chang D. Yoo

TL;DR

The paper tackles open-vocabulary 3D instance segmentation by bridging 3D point-cloud proposals with 2D open-vocabulary proposals through a Zero-Shot Dual-Path Integration Framework. It introduces three components: a 3D pathway for class-agnostic 3D masks, a 2D pathway leveraging open-vocabulary 2D segmentation (Grounded-SAM) with CLIP features, and a two-stage Conditional Integration that uses IoU-based matching and adaptive merging to produce final, text-query-friendly 3D instances. The approach is model-agnostic and zero-shot, demonstrating improvements on ScanNet200 and qualitative results on ARKitScenes, particularly in recognizing unseen or tail-class objects. This cross-modal fusion enhances segmentation accuracy and generalization in real-world indoor environments by exploiting complementary strengths of 3D geometry and 2D vision-language understanding.

Abstract

Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods by enabling the identification of both previously seen and unseen objects in real-world scenarios. It leverages a dual-modality approach, utilizing both 3D point clouds and 2D multi-view images to generate class-agnostic object mask proposals. Previous efforts predominantly focused on enhancing 3D mask proposal models; consequently, the information that could come from 2D association to 3D was not fully exploited. This bias towards 3D data, while effective for familiar indoor objects, limits the system's adaptability to new and varied object types, where 2D models offer greater utility. Addressing this gap, we introduce Zero-Shot Dual-Path Integration Framework that equally values the contributions of both 3D and 2D modalities. Our framework comprises three components: 3D pathway, 2D pathway, and Dual-Path Integration. 3D pathway generates spatially accurate class-agnostic mask proposals of common indoor objects from 3D point cloud data using a pre-trained 3D model, while 2D pathway utilizes pre-trained open-vocabulary instance segmentation model to identify a diverse array of object proposals from multi-view RGB-D images. In Dual-Path Integration, our Conditional Integration process, which operates in two stages, filters and merges the proposals from both pathways adaptively. This process harmonizes output proposals to enhance segmentation capabilities. Our framework, utilizing pre-trained models in a zero-shot manner, is model-agnostic and demonstrates superior performance on both seen and unseen data, as evidenced by comprehensive evaluations on the ScanNet200 and qualitative results on ARKitScenes datasets.

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

TL;DR

The paper tackles open-vocabulary 3D instance segmentation by bridging 3D point-cloud proposals with 2D open-vocabulary proposals through a Zero-Shot Dual-Path Integration Framework. It introduces three components: a 3D pathway for class-agnostic 3D masks, a 2D pathway leveraging open-vocabulary 2D segmentation (Grounded-SAM) with CLIP features, and a two-stage Conditional Integration that uses IoU-based matching and adaptive merging to produce final, text-query-friendly 3D instances. The approach is model-agnostic and zero-shot, demonstrating improvements on ScanNet200 and qualitative results on ARKitScenes, particularly in recognizing unseen or tail-class objects. This cross-modal fusion enhances segmentation accuracy and generalization in real-world indoor environments by exploiting complementary strengths of 3D geometry and 2D vision-language understanding.

Abstract

Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods by enabling the identification of both previously seen and unseen objects in real-world scenarios. It leverages a dual-modality approach, utilizing both 3D point clouds and 2D multi-view images to generate class-agnostic object mask proposals. Previous efforts predominantly focused on enhancing 3D mask proposal models; consequently, the information that could come from 2D association to 3D was not fully exploited. This bias towards 3D data, while effective for familiar indoor objects, limits the system's adaptability to new and varied object types, where 2D models offer greater utility. Addressing this gap, we introduce Zero-Shot Dual-Path Integration Framework that equally values the contributions of both 3D and 2D modalities. Our framework comprises three components: 3D pathway, 2D pathway, and Dual-Path Integration. 3D pathway generates spatially accurate class-agnostic mask proposals of common indoor objects from 3D point cloud data using a pre-trained 3D model, while 2D pathway utilizes pre-trained open-vocabulary instance segmentation model to identify a diverse array of object proposals from multi-view RGB-D images. In Dual-Path Integration, our Conditional Integration process, which operates in two stages, filters and merges the proposals from both pathways adaptively. This process harmonizes output proposals to enhance segmentation capabilities. Our framework, utilizing pre-trained models in a zero-shot manner, is model-agnostic and demonstrates superior performance on both seen and unseen data, as evidenced by comprehensive evaluations on the ScanNet200 and qualitative results on ARKitScenes datasets.
Paper Structure (18 sections, 3 equations, 5 figures, 4 tables)

This paper contains 18 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Instance segmentation results from different modality of 2D and 3D. Our Zero-Shot Dual-Path Integration Framework complementarily integrates outputs from two modalities.
  • Figure 2: Capability of pre-trained open-vocabulary 2D instance segmentation in detecting uncommon and unseen object classes that remain undetected by pre-trained 3D instance segmentation models.
  • Figure 3: Overview of our Zero-Shot Dual-Path Integration Framework. The 3D pathway takes 3D point cloud $\mathbf{P}$ as input to generated class-agnostic 3D instance masks $\mathbf{M}_\text{i}^\text{3D}$ with pre-trained 3D Mask Proposal Network and the per-mask visual features $\mathbf{F}_\text{i}^\text{3D}$ are extracted with CLIP visual encoder radford2021learning. The 2D pathway also generates its own 3D instance masks $\mathbf{M}_\text{j}^\text{2D}$ using RGB-D Image $\mathbf{I}$ input with Open-vocabulary 2D Mask Proposal Network and 2D-to-3D Projection module, along with the per-mask visual features $\mathbf{F}_\text{i}^\text{2D}$ of each mask. The outputs of two pathways are integrated through the Conditional Integration which utilizes Intersection-of-Union (IoU) for Dual-modality Proposal Matching and Adaptive Integration, having final 3D instance results $\mathbf{M}_\text{k}$ and their visual features $\mathbf{F}_\text{k}$ as outputs.
  • Figure 4: Qualitative results showcasing the proficiency of our framework in performing open-vocabulary 3D instance segmentation. The displayed results include objects from two distinct datasets: the upper two objects are from ScanNet200 scenes, while the lower two are from ARKitScenes, demonstrating our framework's adaptability and effectiveness across diverse environments.
  • Figure 5: Qualitative comparison between our Dual-Path Integration Framework and OpenMask3D. The black regions indicate no proposals. Our framework, benefiting from the integration of proposals endowed with high visual understanding capabilities from the 2D pathway, excels in identifying and segmenting uncommon and unseen objects.