Table of Contents
Fetching ...

Weakly Supervised 3D Open-vocabulary Segmentation

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, Shijian Lu

TL;DR

This work tackles open-vocabulary 3D segmentation by weakly distilling knowledge from foundation models CLIP and DINO into a Neural Radiance Field (NeRF) without using segmentation annotations. It introduces a 3D Selection Volume to extract pixel-level CLIP features, plus two losses—Relevancy-Distribution Alignment (RDA) and Feature-Distribution Alignment (FDA)—to align open-vocabulary semantics and boundary cues to 3D space. The approach yields accurate 3D segmentations with long-tail classes and even surpasses some fully supervised methods in certain scenes, demonstrating that 3D open-vocabulary segmentation can leverage 2D image-text data effectively. The method preserves open-vocabulary capabilities by avoiding CLIP finetuning and distills DINO-derived spatial structure to improve object boundaries, offering a practical path toward annotation-free 3D scene understanding.

Abstract

Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.

Weakly Supervised 3D Open-vocabulary Segmentation

TL;DR

This work tackles open-vocabulary 3D segmentation by weakly distilling knowledge from foundation models CLIP and DINO into a Neural Radiance Field (NeRF) without using segmentation annotations. It introduces a 3D Selection Volume to extract pixel-level CLIP features, plus two losses—Relevancy-Distribution Alignment (RDA) and Feature-Distribution Alignment (FDA)—to align open-vocabulary semantics and boundary cues to 3D space. The approach yields accurate 3D segmentations with long-tail classes and even surpasses some fully supervised methods in certain scenes, demonstrating that 3D open-vocabulary segmentation can leverage 2D image-text data effectively. The method preserves open-vocabulary capabilities by avoiding CLIP finetuning and distills DINO-derived spatial structure to improve object boundaries, offering a practical path toward annotation-free 3D scene understanding.

Abstract

Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.
Paper Structure (30 sections, 14 equations, 21 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 14 equations, 21 figures, 6 tables, 1 algorithm.

Figures (21)

  • Figure 1: Weakly Supervised 3D Open-vocabulary Segmentation. Given the multi-view images of a 3D scene and the open-vocabulary text descriptions, our method distills open-vocabulary multimodal knowledge from CLIP and object reasoning ability from DINO into the reconstructed NeRF, producing accurate object boundaries for the 3D scene without requiring any segmentation annotations during training.
  • Figure 2: Mitigating CLIP features' ambiguities with normalized relevancy maps. For original relevancy maps $r_a, r_b$ of classes $a$ and $b$, we note a higher relevancy for class $b$ in Region 2 than in other image regions. Despite this, the ambiguities of CLIP features lead to Region 2's classification as $a$ due to the higher absolute relevancy of $a$ in Region 2, even as $a$ is located in Region 1. To rectify this, we normalize each class's relevancy maps to a fixed range. These normalized relevancy maps, $\bar{r_a}$ and $\bar{r_b}$, reduce such ambiguities, facilitating accurate region-class assignments.
  • Figure 3: Difference between similar and distant distributions. Distributions having large divergence from the target distribution exhibit significantly diverse shapes, increasing the training instability (left). Conversely, distributions displaying low divergence with the target distribution consistently demonstrate a similar shape (right).
  • Figure 4: Qualitative comparisons. Visualization of the segmentation results in 3 scenes. Our method successfully recognizes long-tail classes and produces the most accurate segmentation maps.
  • Figure 5: Studies. Visualization of the studies on ablations and limited input.
  • ...and 16 more figures