Table of Contents
Fetching ...

Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation

Pengfei Wang, Yuxi Wang, Shuai Li, Zhaoxiang Zhang, Zhen Lei, Lei Zhang

TL;DR

Open vocabulary 3D scene understanding is hampered by limited 3D-text data. The authors introduce Geometry Guided Self-Distillation (GGSD), a two-stage framework that first distills knowledge from 2D pre-trained open-vocabulary models using geometry-guided distillation and then further distills knowledge within the 3D network via geometry guided self-distillation. By leveraging 3D geometric priors (superpoints) to constrain this distillation and employing an EMA-based self-labeling with voting, GGSD achieves state-of-the-art open vocabulary performance on indoor and outdoor datasets, including cross-domain scenarios. The results demonstrate the practical value of combining geometry-aware supervision with 3D self-distillation to surpass the 2D teacher and enable robust open vocabulary 3D scene understanding.

Abstract

The scarcity of large-scale 3D-text paired data poses a great challenge on open vocabulary 3D scene understanding, and hence it is popular to leverage internet-scale 2D data and transfer their open vocabulary capabilities to 3D models through knowledge distillation. However, the existing distillation-based 3D scene understanding approaches rely on the representation capacity of 2D models, disregarding the exploration of geometric priors and inherent representational advantages offered by 3D data. In this paper, we propose an effective approach, namely Geometry Guided Self-Distillation (GGSD), to learn superior 3D representations from 2D pre-trained models. Specifically, we first design a geometry guided distillation module to distill knowledge from 2D models, and then leverage the 3D geometric priors to alleviate the inherent noise in 2D models and enhance the representation learning process. Due to the advantages of 3D representation, the performance of the distilled 3D student model can significantly surpass that of the 2D teacher model. This motivates us to further leverage the representation advantages of 3D data through self-distillation. As a result, our proposed GGSD approach outperforms the existing open vocabulary 3D scene understanding methods by a large margin, as demonstrated by our experiments on both indoor and outdoor benchmark datasets.

Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation

TL;DR

Open vocabulary 3D scene understanding is hampered by limited 3D-text data. The authors introduce Geometry Guided Self-Distillation (GGSD), a two-stage framework that first distills knowledge from 2D pre-trained open-vocabulary models using geometry-guided distillation and then further distills knowledge within the 3D network via geometry guided self-distillation. By leveraging 3D geometric priors (superpoints) to constrain this distillation and employing an EMA-based self-labeling with voting, GGSD achieves state-of-the-art open vocabulary performance on indoor and outdoor datasets, including cross-domain scenarios. The results demonstrate the practical value of combining geometry-aware supervision with 3D self-distillation to surpass the 2D teacher and enable robust open vocabulary 3D scene understanding.

Abstract

The scarcity of large-scale 3D-text paired data poses a great challenge on open vocabulary 3D scene understanding, and hence it is popular to leverage internet-scale 2D data and transfer their open vocabulary capabilities to 3D models through knowledge distillation. However, the existing distillation-based 3D scene understanding approaches rely on the representation capacity of 2D models, disregarding the exploration of geometric priors and inherent representational advantages offered by 3D data. In this paper, we propose an effective approach, namely Geometry Guided Self-Distillation (GGSD), to learn superior 3D representations from 2D pre-trained models. Specifically, we first design a geometry guided distillation module to distill knowledge from 2D models, and then leverage the 3D geometric priors to alleviate the inherent noise in 2D models and enhance the representation learning process. Due to the advantages of 3D representation, the performance of the distilled 3D student model can significantly surpass that of the 2D teacher model. This motivates us to further leverage the representation advantages of 3D data through self-distillation. As a result, our proposed GGSD approach outperforms the existing open vocabulary 3D scene understanding methods by a large margin, as demonstrated by our experiments on both indoor and outdoor benchmark datasets.
Paper Structure (13 sections, 6 equations, 4 figures, 6 tables)

This paper contains 13 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Example of the problems when utilizing 2D models, e.g., LSeg li2022languagedriven, for pixel-to-point distillation. We can see the noise caused by the factors such as occlusion, lighting variations and view angles in 2D images. Therefore, only employing the 2D model in distillation brings errors to the 3D model, limiting its 3D representation capacity.
  • Figure 2: Overview of our Geometry Guided Self-Distillation (GGSD) framework, which consists of two main components: Geometry Guided Distillation and Self-Distillation. The first module leverages 3D geometric priors to mitigate the inherent noise in 2D models. In the second module, the 3D network learns from its own predictions. This self-distillation step allows the 3D network to continuously enhance its understanding and representation capabilities. Overall, the GGSD framework combines geometry guidance and self-distillation to fully exploit the valuable information hidden in 3D data, rather than solely relying on 2D pre-trained models.
  • Figure 3: Example of the superpoints generated using VCCS papon2013voxel on ScanNet scene point clouds. Each colored patch corresponds to a distinct superpoint, showcasing a simple geometric structure. Additionally, the points within each superpoint consistently exhibit a shared semantic category.
  • Figure 4: Qualitative results. We present the qualitative results of 3D semantic segmentation on public indoor and outdoor benchmarks.