CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

Hariprasath Govindarajan; Maciej K. Wozniak; Marvin Klingner; Camille Maurice; B Ravi Kiran; Senthil Yogamani

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

Hariprasath Govindarajan, Maciej K. Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani

TL;DR

CleverDistiller tackles cross-modal knowledge distillation from 2D vision foundation models to 3D LiDAR networks by using a simple, self-supervised framework. It replaces complex losses and pseudo-semantic maps with a direct feature similarity objective implemented via a three-layer MLP projection head, and complements this with an auxiliary occupancy task to inject spatial reasoning. The approach achieves state-of-the-art performance on 2D-to-3D KD for semantic segmentation and 3D object detection, particularly when fine-tuning with limited labeled data, and demonstrates strong domain generalization and robustness. The method is computationally efficient and versatile across backbones and datasets, underscoring the practicality of minimalistic cross-modal distillation strategies in autonomous driving contexts.

Abstract

Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

TL;DR

Abstract

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)