Table of Contents
Fetching ...

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

Hariprasath Govindarajan, Maciej K. Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani

TL;DR

CleverDistiller tackles cross-modal knowledge distillation from 2D vision foundation models to 3D LiDAR networks by using a simple, self-supervised framework. It replaces complex losses and pseudo-semantic maps with a direct feature similarity objective implemented via a three-layer MLP projection head, and complements this with an auxiliary occupancy task to inject spatial reasoning. The approach achieves state-of-the-art performance on 2D-to-3D KD for semantic segmentation and 3D object detection, particularly when fine-tuning with limited labeled data, and demonstrates strong domain generalization and robustness. The method is computationally efficient and versatile across backbones and datasets, underscoring the practicality of minimalistic cross-modal distillation strategies in autonomous driving contexts.

Abstract

Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

TL;DR

CleverDistiller tackles cross-modal knowledge distillation from 2D vision foundation models to 3D LiDAR networks by using a simple, self-supervised framework. It replaces complex losses and pseudo-semantic maps with a direct feature similarity objective implemented via a three-layer MLP projection head, and complements this with an auxiliary occupancy task to inject spatial reasoning. The approach achieves state-of-the-art performance on 2D-to-3D KD for semantic segmentation and 3D object detection, particularly when fine-tuning with limited labeled data, and demonstrates strong domain generalization and robustness. The method is computationally efficient and versatile across backbones and datasets, underscoring the practicality of minimalistic cross-modal distillation strategies in autonomous driving contexts.

Abstract

Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy

Paper Structure

This paper contains 31 sections, 1 equation, 6 figures, 13 tables.

Figures (6)

  • Figure 1: We observe how CleverDistiller (top) improves over the baseline ScaLR (bottom) by producing spatially consistent semantic outputs.
  • Figure 2: Overview of the CleverDistiller framework. Sensor calibration is used to associate 3D points with image regions. Features are extracted using modality-specific backbones. A cross-modal KD loss distills camera features into the 3D backbone via an MLP projection head, while an occupancy loss enforces spatial consistency. The pre-trained 3D backbone is then used for downstream tasks.
  • Figure 3: MinkUNet performance (1% finetuning) on nuScenes dataset vs RankMe metric of the 3D backbone features. The colors refer to the distillation method, the image teachers are denoted as $\bullet$ for ViT-S/14 and $\times$ for ViT-B/14.
  • Figure :
  • Figure A1: Qualitative segmentation results comparing our CleverDistiller (right) to ScaLR (middle) and the ground truth (left). Legend corresponds to the ground truth. For ScaLR scalr and CleverDistiller we show prediction errors (in red), showcasing that our approach exhibits much less errors. Each row is a different sequence randomly sampled from nuScenes caesar2020nuscenes.
  • ...and 1 more figures