UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Qingdong He; Jinlong Peng; Zhengkai Jiang; Kai Wu; Xiaozhong Ji; Jiangning Zhang; Yabiao Wang; Chengjie Wang; Mingang Chen; Yunsheng Wu

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Qingdong He, Jinlong Peng, Zhengkai Jiang, Kai Wu, Xiaozhong Ji, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Mingang Chen, Yunsheng Wu

TL;DR

This work tackles open-vocabulary 3D scene understanding by introducing UniM-OV3D, a unified framework that jointly aligns 3D point clouds with image, depth, and text. Central innovations include a trainable hierarchical point-cloud feature extractor for fine-grained geometry and a point-semantic caption learning mechanism that generates global, eye-view, and sector-view captions to provide coarse-to-fine language supervision. Dense cross-modal contrastive losses across (point, image, depth, text) along with the caption losses enable robust multi-modal embedding alignment, aided by a learnable depth encoder and a fixed image/text backbone. Extensive experiments across indoor and outdoor datasets (ScanNet, ScanNet200, S3DIS, nuScenes) demonstrate state-of-the-art open-vocabulary semantic and instance segmentation, validating the effectiveness and generalizability of dense four-modal fusion for scalable 3D understanding.

Abstract

3D open-vocabulary scene understanding aims to recognize arbitrary novel categories beyond the base label space. However, existing works not only fail to fully utilize all the available modal information in the 3D domain but also lack sufficient granularity in representing the features of each modality. In this paper, we propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D, which aligns point clouds with image, language and depth. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module that learns comprehensive fine-grained feature representations. Further, to facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs, capitalizing on geometric constraints across various viewpoints of 3D scenes. Extensive experimental results demonstrate the effectiveness and superiority of our method in open-vocabulary semantic and instance segmentation, which achieves state-of-the-art performance on both indoor and outdoor benchmarks such as ScanNet, ScanNet200, S3IDS and nuScenes. Code is available at https://github.com/hithqd/UniM-OV3D.

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

TL;DR

Abstract

Paper Structure (32 sections, 11 equations, 5 figures, 10 tables)

This paper contains 32 sections, 11 equations, 5 figures, 10 tables.

Introduction
Related Work
3D Recognition
Open-Vocabulary 3D Scene Understanding
Method
Preliminary
Hierarchical Feature Extractor with Local and Global Fusion
Spatial-aware Layers.
Point-semantic Caption Learning
Global-view Point-semantic Caption.
Hierarchical Point-semantic Caption.
Contrastive Point-semantic Caption Training.
Aligning Multimodal Representations
Dense Associations across Modalities.
Experiments
...and 17 more sections

Figures (5)

Figure 1: Open-vocabulary 3D scene understanding. Different colors represent the confidence of matching the user-specified query. (a) Comparison of different methods using the same query, (b) Results of our method in complex reasoning or content that requires extensive world knowledge.
Figure 2: Architecture of our proposed UniM-OV3D. The input point clouds are processed by a hierarchical point cloud extraction module to fuse the local and global features. To fulfill coarse-to-fine text supervision signal, the point-semantic caption learning is designed to acquire representations from various 3D viewpoints. The overall framework takes point clouds, 2D image, text and depth map as input to establish a unified multimodal contrastive learning for open-vocabulary 3D scene understanding.
Figure 3: Point-caption pairs comparison. The middle column represents the captions generated by the point-captioning model, PointLLM xu2023pointllm, and the third column represents the corresponding image captions.
Figure 4: Comparisons on open-vocabulary 3D instance segmentation. (a) B13/N4 on ScanNet, (b) B8/N4 on S3DIS, (c) B12/N3 on nuScenes and (d) B170/N30 on ScanNet200.
Figure A1: Qualitative results comparison between our proposed UniM-OV3D and the state-of-the-art OpenIns3D huang2023openins3d for semantic segmentation on ScanNet dataset.

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

TL;DR

Abstract

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)