Table of Contents
Fetching ...

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

Jinlong Li, Cristiano Saltori, Fabio Poiesi, Nicu Sebe

TL;DR

This work addresses open-vocabulary 3D scene understanding by integrating multiple 2D foundation models (e.g., CLIP, DINOv2, Stable Diffusion) into a single 3D backbone via cross-modal distillation. It introduces a deterministic uncertainty estimator that learns per-branch noise levels to adaptively weight diverse 2D feature supervisions, enabling robust fusion of semantic and geometric priors. Empirical results on ScanNetV2 and Matterport3D show competitive OV3D segmentation and strong cross-dataset generalization, with notable gains over prior zero-shot baselines and improved downstream linear probing performance. The approach highlights the potential of aggregating heterogeneous foundation-model cues to form a foundational 3D model, while outlining avenues for further improvement in embedding alignment and backbone design.

Abstract

The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models-such as CLIP, DINOv2, and Stable Diffusion-into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. The code will be available at: https://github.com/TyroneLi/CUA_O3D.

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

TL;DR

This work addresses open-vocabulary 3D scene understanding by integrating multiple 2D foundation models (e.g., CLIP, DINOv2, Stable Diffusion) into a single 3D backbone via cross-modal distillation. It introduces a deterministic uncertainty estimator that learns per-branch noise levels to adaptively weight diverse 2D feature supervisions, enabling robust fusion of semantic and geometric priors. Empirical results on ScanNetV2 and Matterport3D show competitive OV3D segmentation and strong cross-dataset generalization, with notable gains over prior zero-shot baselines and improved downstream linear probing performance. The approach highlights the potential of aggregating heterogeneous foundation-model cues to form a foundational 3D model, while outlining avenues for further improvement in embedding alignment and backbone design.

Abstract

The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models-such as CLIP, DINOv2, and Stable Diffusion-into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. The code will be available at: https://github.com/TyroneLi/CUA_O3D.

Paper Structure

This paper contains 21 sections, 11 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Top is feature distribution analysis of different 2d projected feature embeddings from various foundation models (Lseg, DINOv2 and Stable Diffusion), enumerating on the overall ScanNetV2 train set and counting the frequency of all point features within each bin interval. Bottom is the sample utilizing K-Means to cluster projected 3D features into specified clusters to make segmentation comparisons. Different foundation models illustrate heterogeneous yet complementary results.
  • Figure 2: Preliminary study on image embedding ambiguity. VLM embeddings show inconsistent segmentations across multi-view images (e.g.cabinet). The guidance with ambiguous embeddings may be detrimental for supervising a 3D model training.
  • Figure 3: Overview of CUA-O3D. We first utilize Lseg, DINOv2 and Stable Diffusion model to extract multi-view posed image embeddings and then use multi-view 3D projection to obtain the projected 3D features $F^{2D}_{i}$ to supervise the 3D model training. Three MLP layers are established to map with each 2D model supervisions independently, while a specific noisy scalar prediction $\sigma_{i}$ through a deterministic uncertainty estimation will be learned and adopted to adaptively weight the corresponding distillation loss $\mathcal{L}$.
  • Figure 4: Left side: $K\_Means$ is tapped to cluster the projected 3D feature embeddings based on Lseg. DINOv2, Stable Diffusion and our final distilled feauture predicted by the 3D model. Right side: UMAP SMG2020 is applied to project high-dimension feature into low-dimension one to visualize the structural characteristics. White rectangle highlights the apparent heterogeneous yet complementary results.
  • Figure 5: Open-vocabulary 3D semantic segmentation comparisons in terms of ScanNetV2 and Matterport3D. Our approach displays superior performance over the OpenScene, which is regarded as our baseline. Best view zoom in and out.
  • ...and 5 more figures