Table of Contents
Fetching ...

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan Lasenby

TL;DR

OpenIns3D tackles 3D open-vocabulary scene understanding using a 3D-input-only pipeline. Its Mask-Snap-Lookup framework first generates class-agnostic 3D masks, then renders synthetic scene-level 2D views, and finally assigns semantic labels by searching through a Class Lookup Table with Mask2Pixel mappings, enabling accurate cross-view classification without aligned 2D imagery. The approach achieves state-of-the-art results on multiple 3D open-vocabulary tasks across indoor and outdoor datasets and supports flexible integration with various 2D detectors and LLM-powered models for complex queries. This work significantly lowers deployment barriers by removing the need for 2D-3D alignment while maintaining strong performance and adaptability in evolving 2D open-world vision systems.

Abstract

In this work, we introduce OpenIns3D, a new 3D-input-only framework for 3D open-vocabulary scene understanding. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds, the "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision-language models to extract interesting objects, and the "Lookup" module searches through the outcomes of "Snap" to assign category names to the proposed masks. This approach, yet simple, achieves state-of-the-art performance across a wide range of 3D open-vocabulary tasks, including recognition, object detection, and instance segmentation, on both indoor and outdoor datasets. Moreover, OpenIns3D facilitates effortless switching between different 2D detectors without requiring retraining. When integrated with powerful 2D open-world models, it achieves excellent results in scene understanding tasks. Furthermore, when combined with LLM-powered 2D models, OpenIns3D exhibits an impressive capability to comprehend and process highly complex text queries that demand intricate reasoning and real-world knowledge. Project page: https://zheninghuang.github.io/OpenIns3D/

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

TL;DR

OpenIns3D tackles 3D open-vocabulary scene understanding using a 3D-input-only pipeline. Its Mask-Snap-Lookup framework first generates class-agnostic 3D masks, then renders synthetic scene-level 2D views, and finally assigns semantic labels by searching through a Class Lookup Table with Mask2Pixel mappings, enabling accurate cross-view classification without aligned 2D imagery. The approach achieves state-of-the-art results on multiple 3D open-vocabulary tasks across indoor and outdoor datasets and supports flexible integration with various 2D detectors and LLM-powered models for complex queries. This work significantly lowers deployment barriers by removing the need for 2D-3D alignment while maintaining strong performance and adaptability in evolving 2D open-world vision systems.

Abstract

In this work, we introduce OpenIns3D, a new 3D-input-only framework for 3D open-vocabulary scene understanding. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds, the "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision-language models to extract interesting objects, and the "Lookup" module searches through the outcomes of "Snap" to assign category names to the proposed masks. This approach, yet simple, achieves state-of-the-art performance across a wide range of 3D open-vocabulary tasks, including recognition, object detection, and instance segmentation, on both indoor and outdoor datasets. Moreover, OpenIns3D facilitates effortless switching between different 2D detectors without requiring retraining. When integrated with powerful 2D open-world models, it achieves excellent results in scene understanding tasks. Furthermore, when combined with LLM-powered 2D models, OpenIns3D exhibits an impressive capability to comprehend and process highly complex text queries that demand intricate reasoning and real-world knowledge. Project page: https://zheninghuang.github.io/OpenIns3D/
Paper Structure (20 sections, 1 equation, 19 figures, 17 tables)

This paper contains 20 sections, 1 equation, 19 figures, 17 tables.

Figures (19)

  • Figure 1: Complex Queries 3D Instance Segmentation with OpenIns3D.
  • Figure 2: High-level Illustrations of OpenIns3D and Quantitative Results. (a) OpenIns3D follows the “Mask-Snap-Lookup” steps for open-vocabulary scene understanding. (b) A list of SOTA results has been achieved on both indoor and outdoor datasets. OV-Rec: open-vocabulary object recognition. OVOD: open-vocabulary object detection. OVIS: open-vocabulary instance segmentation. PointCLIPV2 zhu2022pointclip; Uni3D zhou2023uni3d; Open3DIS nguyen2023open3dis; FM-OV3D zhang2023fmov3d
  • Figure 3: Four Categories of Open-Vocabulary 3D Scene Understanding Models. a) 3D feature distillation frameworks, where 2D images are used as a bridge to distil language-aligned features into 3D, with typical works including OpenScene peng2022openscene and Clip2Scene chen2023clip2scene. b) Building 3D-text pairs, where 2D captioning models are used to build 3D-text pairs for feature learning, with typical works including the PLA-family ding2023lowis3dyang2023regionplcding2022language c) CLIP and Projection, where objects are cropped out of 2D images before being processed by CLIP, and the results are directly projected into 3D, including OpenMask3D takmaz2023openmask3d, OV-3DET lu2023open, CLIP$^2$zeng2023clip2 and Open3DIS nguyen2023open3dis. d) OpenIns3D
  • Figure 4: General Pipeline of OpenIns3D OpenIns3D first processes point clouds with MPM to generate 3D mask proposals and mask scores. The Snap module (detailed in Figure \ref{['fig: snap']}) then renders $N$ synthetic scene-level images, which are later passed into the 2D open-world model along with the input text queries. The detection results from the 2D model are stored in the Class Lookup Table (CLT). Finally, both the mask proposals and CLT are fed into the Lookup module, where Mask2Pixel Guided Lookup (detailed in Figure \ref{['fig: mask2point']}) is performed at the global level, followed by a Local Enforced Lookup at the local level to unlock the semantic meaning of mask proposals. The final mask filtering refines the mask proposals and obtains the final results.
  • Figure 5: Snap and Mask2Pixel Maps. Multiscale snaps are conducted to render images with different levels of detail for scene understanding, including wide-corner snaps, wide-angle snaps, and global snaps. Cameras are positioned on the top of the scene and point towards the centre or corners, and the field of view is determined with the calibrated intrinsic matrix. With the defined camera models, Mask2Pixel maps are built to store the location of each 3D mask in the 2D image (using the same colour to represent 2D-3D correspondences) to guide the search for category names.
  • ...and 14 more figures