Weakly Supervised 3D Open-vocabulary Segmentation
Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, Shijian Lu
TL;DR
This work tackles open-vocabulary 3D segmentation by weakly distilling knowledge from foundation models CLIP and DINO into a Neural Radiance Field (NeRF) without using segmentation annotations. It introduces a 3D Selection Volume to extract pixel-level CLIP features, plus two losses—Relevancy-Distribution Alignment (RDA) and Feature-Distribution Alignment (FDA)—to align open-vocabulary semantics and boundary cues to 3D space. The approach yields accurate 3D segmentations with long-tail classes and even surpasses some fully supervised methods in certain scenes, demonstrating that 3D open-vocabulary segmentation can leverage 2D image-text data effectively. The method preserves open-vocabulary capabilities by avoiding CLIP finetuning and distills DINO-derived spatial structure to improve object boundaries, offering a practical path toward annotation-free 3D scene understanding.
Abstract
Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.
