Table of Contents
Fetching ...

3D Annotation-Free Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving

Boyi Sun, Yuhang Liu, Xingxia Wang, Bin Tian, Long Chen, Fei-Yue Wang

TL;DR

This work tackles the high cost of 3D data labeling for autonomous driving by introducing AFOV, a two-stage annotation-free framework that distills knowledge from high-quality 2D open-vocabulary segmentation models into a 3D backbone. The first stage, Tri-Modal Contrastive Pre-training (TMP), synchronously generates and aligns masks, image features, and text embeddings to embed semantic understanding into the 3D representation without relying on 2D backbones during pre-training, while the second stage uses pseudo-label guided knowledge distillation for 3D learning. To combat unobserved regions and label noise, AFI provides a robust, non-parametric correction mechanism that leverages approximate planes and directional correlations, and a superpixel-superpoint construction ties 2D segmentation granularity to 3D point clouds. Empirically, AFOV achieves state-of-the-art annotation-free nuScenes 3D segmentation with $mIoU=47.73$ extpercent, and strong downstream transfer with 1% data fine-tuning ($mIoU=51.75$ extpercent) and 100% linear probing ($mIoU=56.35$ extpercent), demonstrating the viability and practicality of 2D-to-3D knowledge distillation for reducing annotation costs in real-world autonomous-driving pipelines.

Abstract

Point cloud data labeling is considered a time-consuming and expensive task in autonomous driving, whereas annotation-free learning training can avoid it by learning point cloud representations from unannotated data. In this paper, we propose AFOV, a novel 3D \textbf{A}nnotation-\textbf{F}ree framework assisted by 2D \textbf{O}pen-\textbf{V}ocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of AFOV, extensive experiments are conducted on multiple related datasets. We achieved a record-breaking 47.73\% mIoU on the annotation-free 3D segmentation task in nuScenes, surpassing the previous best model by 3.13\% mIoU. Meanwhile, the performance of fine-tuning with 1\% data on nuScenes and SemanticKITTI reached a remarkable 51.75\% mIoU and 48.14\% mIoU, outperforming all previous pre-trained models

3D Annotation-Free Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving

TL;DR

This work tackles the high cost of 3D data labeling for autonomous driving by introducing AFOV, a two-stage annotation-free framework that distills knowledge from high-quality 2D open-vocabulary segmentation models into a 3D backbone. The first stage, Tri-Modal Contrastive Pre-training (TMP), synchronously generates and aligns masks, image features, and text embeddings to embed semantic understanding into the 3D representation without relying on 2D backbones during pre-training, while the second stage uses pseudo-label guided knowledge distillation for 3D learning. To combat unobserved regions and label noise, AFI provides a robust, non-parametric correction mechanism that leverages approximate planes and directional correlations, and a superpixel-superpoint construction ties 2D segmentation granularity to 3D point clouds. Empirically, AFOV achieves state-of-the-art annotation-free nuScenes 3D segmentation with extpercent, and strong downstream transfer with 1% data fine-tuning ( extpercent) and 100% linear probing ( extpercent), demonstrating the viability and practicality of 2D-to-3D knowledge distillation for reducing annotation costs in real-world autonomous-driving pipelines.

Abstract

Point cloud data labeling is considered a time-consuming and expensive task in autonomous driving, whereas annotation-free learning training can avoid it by learning point cloud representations from unannotated data. In this paper, we propose AFOV, a novel 3D \textbf{A}nnotation-\textbf{F}ree framework assisted by 2D \textbf{O}pen-\textbf{V}ocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of AFOV, extensive experiments are conducted on multiple related datasets. We achieved a record-breaking 47.73\% mIoU on the annotation-free 3D segmentation task in nuScenes, surpassing the previous best model by 3.13\% mIoU. Meanwhile, the performance of fine-tuning with 1\% data on nuScenes and SemanticKITTI reached a remarkable 51.75\% mIoU and 48.14\% mIoU, outperforming all previous pre-trained models
Paper Structure (36 sections, 18 equations, 8 figures, 6 tables)

This paper contains 36 sections, 18 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Segmentation results of AFOV annotation-free training. More illustrations are presented in Appendix D.
  • Figure 2: Performance of AFOV on nuScenes.
  • Figure 3: Overview of AFOV, which consists of two stages: Tri-Modal Pre-training (TMP) and Annotation-free training (AFOV-baseline). Both stages leverage masks and mask labels extracted from 2D open-vocabulary segmentation models, while mask features and text features are employed only in TMP. TMP enhances scene understanding through contrastive losses: superpixel-superpoint loss $\mathcal{L}_{I-P}$ and text-superpoint loss $\mathcal{L}_{T-P}$, while our baseline employs pseudo-labels to supervise the 3D network. Additionally, to bridge dataset classes and open vocabularies, we introduce a class dictionary. The Approximate Flat Interaction (AFI) optimizes the results by spatial structural analysis in a broad perception domain.
  • Figure 4: Illustrating two examples of potential "self-conflicts" based on SAM segmentation.
  • Figure 5: Illustration of image segmentation results of various 2D open-vocabulary segmentation models. We observe that MaskCLIP (pixel-level CLIP) exhibits label confusion and high error rates in semantic segmentation. The output of SEEM not only suffers from missing masks but also contains incorrect mask annotations. More results are provided in Appendix D.
  • ...and 3 more figures