3D Annotation-Free Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving
Boyi Sun, Yuhang Liu, Xingxia Wang, Bin Tian, Long Chen, Fei-Yue Wang
TL;DR
This work tackles the high cost of 3D data labeling for autonomous driving by introducing AFOV, a two-stage annotation-free framework that distills knowledge from high-quality 2D open-vocabulary segmentation models into a 3D backbone. The first stage, Tri-Modal Contrastive Pre-training (TMP), synchronously generates and aligns masks, image features, and text embeddings to embed semantic understanding into the 3D representation without relying on 2D backbones during pre-training, while the second stage uses pseudo-label guided knowledge distillation for 3D learning. To combat unobserved regions and label noise, AFI provides a robust, non-parametric correction mechanism that leverages approximate planes and directional correlations, and a superpixel-superpoint construction ties 2D segmentation granularity to 3D point clouds. Empirically, AFOV achieves state-of-the-art annotation-free nuScenes 3D segmentation with $mIoU=47.73$ extpercent, and strong downstream transfer with 1% data fine-tuning ($mIoU=51.75$ extpercent) and 100% linear probing ($mIoU=56.35$ extpercent), demonstrating the viability and practicality of 2D-to-3D knowledge distillation for reducing annotation costs in real-world autonomous-driving pipelines.
Abstract
Point cloud data labeling is considered a time-consuming and expensive task in autonomous driving, whereas annotation-free learning training can avoid it by learning point cloud representations from unannotated data. In this paper, we propose AFOV, a novel 3D \textbf{A}nnotation-\textbf{F}ree framework assisted by 2D \textbf{O}pen-\textbf{V}ocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of AFOV, extensive experiments are conducted on multiple related datasets. We achieved a record-breaking 47.73\% mIoU on the annotation-free 3D segmentation task in nuScenes, surpassing the previous best model by 3.13\% mIoU. Meanwhile, the performance of fine-tuning with 1\% data on nuScenes and SemanticKITTI reached a remarkable 51.75\% mIoU and 48.14\% mIoU, outperforming all previous pre-trained models
