Table of Contents
Fetching ...

LiDAR-Anchored Collaborative Distillation for Robust 2D Representations

Wonjun Jo, Hyunwoo Ha, Kim Ji-Yeon, Hawook Jeong, Tae-Hyun Oh

TL;DR

This work tackles the fragility of self-supervised 2D image encoders in adverse weather by introducing a LiDAR-anchored Collaborative Distillation framework. It employs a two-stage, cross-modal approach: Stage 1 pre-aligns LiDAR features to the clear-day 2D feature space, and Stage 2 uses these aligned 3D features as 3D-anchored supervision to denoise and regularize degraded 2D representations. The method improves in-domain and out-of-domain semantic segmentation and depth estimation, while also enhancing 3D awareness, with strong generalization across outdoor and indoor datasets. Practically, this yields more robust perception pipelines for vision-based systems operating under real-world, degraded conditions."

Abstract

As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders fall short in conducting the task under noisy and adverse weather conditions beyond clear daytime scenes, which require for robust visual perception. To address these issues, we propose a novel self-supervised approach, \textbf{Collaborative Distillation}, which leverages 3D LiDAR as self-supervision to improve robustness to noisy and adverse weather conditions in 2D image encoders while retaining their original capabilities. Our method outperforms competing methods in various downstream tasks across diverse conditions and exhibits strong generalization ability. In addition, our method also improves 3D awareness stemming from LiDAR's characteristics. This advancement highlights our method's practicality and adaptability in real-world scenarios.

LiDAR-Anchored Collaborative Distillation for Robust 2D Representations

TL;DR

This work tackles the fragility of self-supervised 2D image encoders in adverse weather by introducing a LiDAR-anchored Collaborative Distillation framework. It employs a two-stage, cross-modal approach: Stage 1 pre-aligns LiDAR features to the clear-day 2D feature space, and Stage 2 uses these aligned 3D features as 3D-anchored supervision to denoise and regularize degraded 2D representations. The method improves in-domain and out-of-domain semantic segmentation and depth estimation, while also enhancing 3D awareness, with strong generalization across outdoor and indoor datasets. Practically, this yields more robust perception pipelines for vision-based systems operating under real-world, degraded conditions."

Abstract

As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders fall short in conducting the task under noisy and adverse weather conditions beyond clear daytime scenes, which require for robust visual perception. To address these issues, we propose a novel self-supervised approach, \textbf{Collaborative Distillation}, which leverages 3D LiDAR as self-supervision to improve robustness to noisy and adverse weather conditions in 2D image encoders while retaining their original capabilities. Our method outperforms competing methods in various downstream tasks across diverse conditions and exhibits strong generalization ability. In addition, our method also improves 3D awareness stemming from LiDAR's characteristics. This advancement highlights our method's practicality and adaptability in real-world scenarios.
Paper Structure (59 sections, 2 equations, 8 figures, 20 tables)

This paper contains 59 sections, 2 equations, 8 figures, 20 tables.

Figures (8)

  • Figure 1: Collaborative Distillation. Under adverse weather conditions, the 2D feature distribution degrades (red) while the 3D feature distribution remains stable (green). Stage 1 aligns the 3D feature distribution to the 2D clear-side (blue). Stage 2 uses the aligned 3D features to pull the 2D degraded-side toward the 2D clear-side. This yields robust 2D features with original semantic context.
  • Figure 2: Overall Pipeline of the proposed method. (a) Stage 1 (Pre-alignment) aligns the 3D features to the clear-side 2D features, and Stage 2 (3D-anchored self-supervision) pulls degraded 2D features under adverse conditions toward the pre-aligned 3D features. (b) The bi-directional distillation module matches pixel- and point-wise features and applies cross-modal distillation loss.
  • Figure 3: t-SNE visualization of extracted image features. Compared with DINOv2 and DINOv2+, where clear- and adverse-side clusters remain separated, our method shifts adverse-side toward the clear-side cluster, achieving the intended distribution shift.
  • Figure 4: Feature Visualization. Compared with DINOv2 and DINOv2+, our method produces cleaner feature across all conditions, indicating improved robustness and feature denoising effect.
  • Figure 5: Qualitative results of out-of-domain depth estimation. Compared with DINOv2 and Condense zhang2024condense, our method yields clearer and less noisy depth maps in boxed areas, exhibiting robustness and strong generalization despite being pre-trained only on outdoor.
  • ...and 3 more figures