Table of Contents
Fetching ...

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M. Patel, Lei Zhang

TL;DR

This work introduces Dense Multimodal Alignment (DMA), a framework that densely co-embeds 3D points, 2D image pixels, and 1D text in a shared space to enable open-vocabulary 3D scene understanding. The text modality is generated comprehensively using a tagging model (RAM) and multimodal LLMs (LLaVA) to provide complete category coverage and scalable scene descriptions, while 2D features are enhanced via a dual-path FC-CLIP approach with a frozen visual encoder and a trainable mask head. Dense associations across modalities are built by using the image as a bridge to create dense point-to-text correspondences, followed by a mutual inclusive loss to align modalities in a joint embedding space. Experiments on ScanNet, Matterport3D, and nuScenes demonstrate competitive open-vocabulary segmentation performance, validating the method’s effectiveness in both indoor and outdoor settings. The approach leverages vision-language foundation models to maximize cross-modal supervision while preserving open-vocabulary capabilities, offering a scalable path for robust 3D scene understanding.

Abstract

Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space for maximizing their synergistic benefits. Instead of extracting coarse view- or region-level text prompts, we leverage large vision-language models to extract complete category information and scalable scene descriptions to build the text modality, and take image modality as the bridge to build dense point-pixel-text associations. Besides, in order to enhance the generalization ability of the 2D model for downstream 3D tasks without compromising the open-vocabulary capability, we employ a dual-path integration approach to combine frozen CLIP visual features and learnable mask features. Extensive experiments show that our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

TL;DR

This work introduces Dense Multimodal Alignment (DMA), a framework that densely co-embeds 3D points, 2D image pixels, and 1D text in a shared space to enable open-vocabulary 3D scene understanding. The text modality is generated comprehensively using a tagging model (RAM) and multimodal LLMs (LLaVA) to provide complete category coverage and scalable scene descriptions, while 2D features are enhanced via a dual-path FC-CLIP approach with a frozen visual encoder and a trainable mask head. Dense associations across modalities are built by using the image as a bridge to create dense point-to-text correspondences, followed by a mutual inclusive loss to align modalities in a joint embedding space. Experiments on ScanNet, Matterport3D, and nuScenes demonstrate competitive open-vocabulary segmentation performance, validating the method’s effectiveness in both indoor and outdoor settings. The approach leverages vision-language foundation models to maximize cross-modal supervision while preserving open-vocabulary capabilities, offering a scalable path for robust 3D scene understanding.

Abstract

Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space for maximizing their synergistic benefits. Instead of extracting coarse view- or region-level text prompts, we leverage large vision-language models to extract complete category information and scalable scene descriptions to build the text modality, and take image modality as the bridge to build dense point-pixel-text associations. Besides, in order to enhance the generalization ability of the 2D model for downstream 3D tasks without compromising the open-vocabulary capability, we employ a dual-path integration approach to combine frozen CLIP visual features and learnable mask features. Extensive experiments show that our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.
Paper Structure (17 sections, 7 equations, 12 figures, 8 tables)

This paper contains 17 sections, 7 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Framework of our proposed Dense Multimodal Alignment (DMA) method. We generate comprehensive language modality data by leveraging a tagging model and an MLLM. As for 2D modality, we fix the CLIP visual backbone ${\bf f}^{2D}_{clip}$ but finetune the mask head ${\bf f}^{2D}_{mask}$ for better adaptation to downstream 3D tasks without compromising the open-vocabulary ability. Then the dense correspondences between pixels ${\bf f}^{2D}$ and texts ${\bf f}^{T}_{tag}$/${\bf f}^{T}_{llm}$ can be built by computing their feature similarities, resulting in semantic score maps $S^{2D}_{tag}$/$S^{2D}_{llm}$. By taking image modality as the bridge, we back-project text labels to each point and obtain the 3D label maps $M^{3D}_{tag}$/$M^{3D}_{llm}$. Finally, we co-embed point ${\bf f}^{3D}$, pixel ${\bf f}^{2D}$, and text embeddings ${\bf f}^{T}$ into a common space to learn a robust 3D representation by optimizing the mutually inclusive loss function.
  • Figure 2: Scene tagging generation. (1) We first employ RAM zhang2023recognize to generate view-level tags, and then (2) reduce the tag noise with GPT. Finally, scene-level tags are generated by (3) multi-view voting.
  • Figure 3: Segmentation results using 2D and 3D models. 2D model has advantages in segmenting background objects (in blue boxes), while 3D model is more favorable for foreground objects with distinct structures (in red boxes).
  • Figure 4: Qualitative results of different methods on both indoor and outdoor datasets.
  • Figure 5: Comparisons of text, 2D, and 3D features. "F" and "B" denote foreground and background classes.
  • ...and 7 more figures