Table of Contents
Fetching ...

Three Pillars improving Vision Foundation Model Distillation for Lidar

Gilles Puy, Spyros Gidaris, Alexandre Boulch, Oriane Siméoni, Corentin Sautier, Patrick Pérez, Andrei Bursuc, Renaud Marlet

TL;DR

This work tackles the gap between distilled and fully supervised 3D LiDAR features by focusing on three pillars: scaling the 3D backbone, leveraging high-quality 2D pretrained backbones, and pretraining on diverse datasets. It introduces ScaLR, a scalable, hyperparameter-free distillation approach that aligns 3D point features with 2D pixel features via a cosine similarity loss, while loading only a single random camera per batch to simplify multi-dataset pretraining. Empirically, scaling backbones and combining diverse pretraining data yield substantial improvements, achieving up to $67.8\%$ mIoU in linear probing on nuScenes and reducing the gap to fully supervised baselines to under $10.9\%$ in several settings, with strong robustness to domain shifts and perturbations such as Robo3D corruptions ($mCE=87.4\%$, $mRR=83.8\%$ on average). The results also show that a single backbone pretrained on multiple datasets can match or surpass specialized backbones, and that multi-teacher distillation can further boost performance, offering a scalable path toward robust vision-to-LiDAR knowledge transfer in autonomous driving systems.

Abstract

Self-supervised image backbones can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. Ideally, 3D backbones for lidar should be able to inherit these properties after distillation of these powerful 2D features. The most recent methods for image-to-lidar distillation on autonomous driving data show promising results, obtained thanks to distillation methods that keep improving. Yet, we still notice a large performance gap when measuring the quality of distilled and fully supervised features by linear probing. In this work, instead of focusing only on the distillation method, we study the effect of three pillars for distillation: the 3D backbone, the pretrained 2D backbones, and the pretraining dataset. In particular, thanks to our scalable distillation method named ScaLR, we show that scaling the 2D and 3D backbones and pretraining on diverse datasets leads to a substantial improvement of the feature quality. This allows us to significantly reduce the gap between the quality of distilled and fully-supervised 3D features, and to improve the robustness of the pretrained backbones to domain gaps and perturbations.

Three Pillars improving Vision Foundation Model Distillation for Lidar

TL;DR

This work tackles the gap between distilled and fully supervised 3D LiDAR features by focusing on three pillars: scaling the 3D backbone, leveraging high-quality 2D pretrained backbones, and pretraining on diverse datasets. It introduces ScaLR, a scalable, hyperparameter-free distillation approach that aligns 3D point features with 2D pixel features via a cosine similarity loss, while loading only a single random camera per batch to simplify multi-dataset pretraining. Empirically, scaling backbones and combining diverse pretraining data yield substantial improvements, achieving up to mIoU in linear probing on nuScenes and reducing the gap to fully supervised baselines to under in several settings, with strong robustness to domain shifts and perturbations such as Robo3D corruptions (, on average). The results also show that a single backbone pretrained on multiple datasets can match or surpass specialized backbones, and that multi-teacher distillation can further boost performance, offering a scalable path toward robust vision-to-LiDAR knowledge transfer in autonomous driving systems.

Abstract

Self-supervised image backbones can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. Ideally, 3D backbones for lidar should be able to inherit these properties after distillation of these powerful 2D features. The most recent methods for image-to-lidar distillation on autonomous driving data show promising results, obtained thanks to distillation methods that keep improving. Yet, we still notice a large performance gap when measuring the quality of distilled and fully supervised features by linear probing. In this work, instead of focusing only on the distillation method, we study the effect of three pillars for distillation: the 3D backbone, the pretrained 2D backbones, and the pretraining dataset. In particular, thanks to our scalable distillation method named ScaLR, we show that scaling the 2D and 3D backbones and pretraining on diverse datasets leads to a substantial improvement of the feature quality. This allows us to significantly reduce the gap between the quality of distilled and fully-supervised 3D features, and to improve the robustness of the pretrained backbones to domain gaps and perturbations.
Paper Structure (23 sections, 1 equation, 3 figures, 9 tables)

This paper contains 23 sections, 1 equation, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Correlation properties of distilled 3D features. Correlation maps with a point located on a car on four different scenes extracted from nuScenes nuscenes, SemanticKITTI behley2019semantickitti, Pandar64 and PandarGT pandaset, respectively. The features used to compute these maps are extracted from a single pretrained backbone on all four datasets with ScaLR. Color goes from blue to red for low and high values.
  • Figure 2: Distilled feature visualizations. We project the features at the output of $\phi_{\rm 3D}$ into a three-dimensional space by PCA. The projected value serves as RGB value to color the point clouds, i.e., the first, second and third components are used as the red, green and blue channels, respectively. Note that the PCA is done independently for each scan, which explains why the colors are not consistent from one scan to another. In this figure, we used the WI-768 pretrained on nuScenes, SemanticKITTI, Pandar 64 and Pandar GT with ScaLR.
  • Figure 3: Similarity map with class prototype. For each scan, we use the ground-truth labels (presented on even rows) of four classes (car, pedestrian, road, sidewalk) to compute a class prototype (mean feature of the point belonging to the considered class). We then compute the feature similarity map (presented on odd rows) with respect to that class prototype. Color goes from blue to red for low and high values.