Table of Contents
Fetching ...

ELiTe: Efficient Image-to-LiDAR Knowledge Transfer for Semantic Segmentation

Zhibo Zhang, Ximing Yang, Weizhong Zhang, Cheng Jin

TL;DR

ELiTe tackles the weak-teacher problem in cross-modal LiDAR semantic segmentation by transferring rich image priors from a Segment Anything Model (SAM) through Patch-to-Point Multi-Stage Knowledge Distillation (PPMSKD) to a lightweight LiDAR student. It couples Vision Foundation Model fine-tuning with Parameter-Efficient Fine-Tuning (PEFT) and a SAM-based pseudo-label generation (SAM-PLG) to provide dense, high-quality supervision despite sparse ground truth. The approach delivers state-of-the-art performance on SemanticKITTI with a small parameter budget (1.9M for the student) and real-time inference (24 Hz), while maintaining strong training efficiency. The findings demonstrate that leveraging diverse open-world image priors and efficient cross-modal distillation can yield robust, efficient LiDAR segmentation suitable for real-world deployment.

Abstract

Cross-modal knowledge transfer enhances point cloud representation learning in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher challenge} arises due to repetitive and non-diverse car camera images and sparse, inaccurate ground truth labels. To address this, we propose the Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive knowledge from the Vision Foundation Model (VFM), extensively trained on diverse open-world images. This enables effective knowledge transfer to a lightweight student model across modalities. ELiTe employs Parameter-Efficient Fine-Tuning to strengthen the VFM teacher and expedite large-scale model training with minimal costs. Additionally, we introduce the Segment Anything Model based Pseudo-Label Generation approach to enhance low-quality image labels, facilitating robust semantic representations. Efficient knowledge transfer in ELiTe yields state-of-the-art results on the SemanticKITTI benchmark, outperforming real-time inference models. Our approach achieves this with significantly fewer parameters, confirming its effectiveness and efficiency.

ELiTe: Efficient Image-to-LiDAR Knowledge Transfer for Semantic Segmentation

TL;DR

ELiTe tackles the weak-teacher problem in cross-modal LiDAR semantic segmentation by transferring rich image priors from a Segment Anything Model (SAM) through Patch-to-Point Multi-Stage Knowledge Distillation (PPMSKD) to a lightweight LiDAR student. It couples Vision Foundation Model fine-tuning with Parameter-Efficient Fine-Tuning (PEFT) and a SAM-based pseudo-label generation (SAM-PLG) to provide dense, high-quality supervision despite sparse ground truth. The approach delivers state-of-the-art performance on SemanticKITTI with a small parameter budget (1.9M for the student) and real-time inference (24 Hz), while maintaining strong training efficiency. The findings demonstrate that leveraging diverse open-world image priors and efficient cross-modal distillation can yield robust, efficient LiDAR segmentation suitable for real-world deployment.

Abstract

Cross-modal knowledge transfer enhances point cloud representation learning in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher challenge} arises due to repetitive and non-diverse car camera images and sparse, inaccurate ground truth labels. To address this, we propose the Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive knowledge from the Vision Foundation Model (VFM), extensively trained on diverse open-world images. This enables effective knowledge transfer to a lightweight student model across modalities. ELiTe employs Parameter-Efficient Fine-Tuning to strengthen the VFM teacher and expedite large-scale model training with minimal costs. Additionally, we introduce the Segment Anything Model based Pseudo-Label Generation approach to enhance low-quality image labels, facilitating robust semantic representations. Efficient knowledge transfer in ELiTe yields state-of-the-art results on the SemanticKITTI benchmark, outperforming real-time inference models. Our approach achieves this with significantly fewer parameters, confirming its effectiveness and efficiency.
Paper Structure (32 sections, 3 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 32 sections, 3 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: Car Camera Images and Open World Images.
  • Figure 2: Existed one-stage Point-to-Pixel Correspondence based Ground Truth Generation (PPC-GTG) and our two-stage SAM-based Pseudo-Label Generation (SAM-PLG). The sparse mask labels generated by one-stage PPC-GTG are incomplete, inaccurate, and low-quality. The dense mask labels generated by two-stage SAM-PLG are more accurate, complete, and high quality.
  • Figure 3: Framework Overview. It comprises three main components: VFM teacher, lightweight student, and Patch-to-Point Multi-Stage Knowledge Distillation(PPMSKD) networks. The teacher and student networks process image and LiDAR inputs, extracting multi-stage features. In the PPMSKD network, the knowledge from the teacher is transferred to the student. The VFM teacher network undergoes domain-adaptive fine-tuning via PEFT and is supervised by pseudo-labels generated by SAM-PLG. In this figure, TB(WA) and TB(GA) denote Transformer Blocks employing window and global attention, respectively, "Patch Dec." signifies the Patch Decoder, and $\odot$ represents concatenation. Solid lines delineate the data flow, while dashed lines represent the backpropagation supervisory signal.
  • Figure 4: T-SNE Visualization. In (b) and (c), the left part clusters are from the image teacher, and the right part clusters are from the LiDAR student. The feature is the intersection of points and pixels extracted from a single frame segmentation, so each group of clusters has the same number of feature points.
  • Figure 5: Visualization of images and their Pseudo-Labels.
  • ...and 1 more figures