ELiTe: Efficient Image-to-LiDAR Knowledge Transfer for Semantic Segmentation
Zhibo Zhang, Ximing Yang, Weizhong Zhang, Cheng Jin
TL;DR
ELiTe tackles the weak-teacher problem in cross-modal LiDAR semantic segmentation by transferring rich image priors from a Segment Anything Model (SAM) through Patch-to-Point Multi-Stage Knowledge Distillation (PPMSKD) to a lightweight LiDAR student. It couples Vision Foundation Model fine-tuning with Parameter-Efficient Fine-Tuning (PEFT) and a SAM-based pseudo-label generation (SAM-PLG) to provide dense, high-quality supervision despite sparse ground truth. The approach delivers state-of-the-art performance on SemanticKITTI with a small parameter budget (1.9M for the student) and real-time inference (24 Hz), while maintaining strong training efficiency. The findings demonstrate that leveraging diverse open-world image priors and efficient cross-modal distillation can yield robust, efficient LiDAR segmentation suitable for real-world deployment.
Abstract
Cross-modal knowledge transfer enhances point cloud representation learning in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher challenge} arises due to repetitive and non-diverse car camera images and sparse, inaccurate ground truth labels. To address this, we propose the Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive knowledge from the Vision Foundation Model (VFM), extensively trained on diverse open-world images. This enables effective knowledge transfer to a lightweight student model across modalities. ELiTe employs Parameter-Efficient Fine-Tuning to strengthen the VFM teacher and expedite large-scale model training with minimal costs. Additionally, we introduce the Segment Anything Model based Pseudo-Label Generation approach to enhance low-quality image labels, facilitating robust semantic representations. Efficient knowledge transfer in ELiTe yields state-of-the-art results on the SemanticKITTI benchmark, outperforming real-time inference models. Our approach achieves this with significantly fewer parameters, confirming its effectiveness and efficiency.
