Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Mohamed El Amine Boudjoghra; Angela Dai; Jean Lahoud; Hisham Cholakkal; Rao Muhammad Anwer; Salman Khan; Fahad Shahbaz Khan

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

TL;DR

Open-YOLO 3D tackles the slow inference of open-vocabulary 3D instance segmentation by replacing heavy 2D-3D feature lifting (SAM/CLIP) with fast 2D open-vocabulary object detectors to produce bounding boxes that label 3D proposals. The method builds Low-Granularity label maps from multi-view 2D boxes, computes fast 3D mask visibility with VAcc, and uses Multi-View Prompt Distribution to assign text prompts to class-agnostic 3D masks. It also introduces a class-agnostic 3D proposal network (Mask3D) and a new confidence score combining IoU across views with MVPDist, achieving state-of-the-art results on ScanNet200 and Replica while delivering up to ~16x speedups. This approach demonstrates that efficient 2D detectors can obviate the need for expensive 3D foundation models in open-vocabulary 3D scenes, enabling practical deployment in robotics and AR applications.

Abstract

Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\sim$16$\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D.

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

TL;DR

Abstract

speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D.

Paper Structure (15 sections, 5 equations, 20 figures, 5 tables)

This paper contains 15 sections, 5 equations, 20 figures, 5 tables.

Introduction
Related works
Preliminaries
Baseline Open-Vocabulary 3D Instance Segmentation
Method: Open-YOLO 3D
Overall Architecture
3D Object Proposal
Low Granularity (LG) Label-Maps
Accelerated Visibility Computation (VAcc)
Multi-View Prompt Distribution (MVPDist)
Instance Prediction Confidence Score
Experiments
Results analysis
Conclusion
Appendix

Figures (20)

Figure 1: Open-vocabulary 3D instance segmentation with our Open-YOLO 3D. The proposed Open-YOLO 3D is capable of segmenting objects in a zero-shot manner. Here, We show the output for a ScanNet200 rozenberszki2022language scene with various prompts, where our model yields improved performance compared to the recent Open3DIS nguyen2023open3dis. We show zoomed-in images of hidden predicted instances in the colored boxes. Additional results are in Figure \ref{['fig:qualitatives_replica']} and suppl. material.
Figure 2: Proposed open-world 3D instance segmentation pipeline. We use a 3D instance segmentation network (3D Network) for generating class-agnostic proposals. For open-vocabulary prediction, a 2D Open-Vocabulary Object Detector (2D OVOD) generates bounding boxes with class labels. These predictions are used to construct label maps for all input frames. Next, we assign the top-k label maps to each 3D proposal based on visibility. Finally, we generate a Multi-View Prompt Distribution from the 2D projections of the proposals to match a text prompt to every 3D proposal.
Figure 3: Multi-View Prompt Distribution (MVPDist). After creating the LG label maps for all frames, we select the top-k label maps based on the 2D projection of the 3D proposal. Using the (x, y) coordinates of the 2D projection, we choose the labels from the LG label maps to generate the MVPDist. This distribution predicts the ID of the text prompt with the highest probability.
Figure 4: Qualitative results on scene office0 in the Replica dataset. We show instances with a confidence score above 0.5 for both methods. We show that our method is much more precise when segmenting the object in the text compared to state-of-the-art method Open3DIS.
Figure 5: Additional details on how IoU score $s_{IoU}$ is computed. We show that our method can provide a reliable mask score using Intersection Over Union (IoU) between the bounding boxes estimated using the 3D cropped instance 2D projection and the bounding boxes from a 2D object detector. We also demonstrate that it covers all three cases of different 3D mask proposals.
...and 15 more figures

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

TL;DR

Abstract

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (20)