Table of Contents
Fetching ...

Image-to-Lidar Relational Distillation for Autonomous Driving Data

Anas Mahmoud, Ali Harakeh, Steven Waslander

TL;DR

This work investigates the gap in structure between the 2D and the 3D representations that result from state-of-the-art distillation frameworks and reveals a significant mismatch, and proposes a relational distillation framework enforcing intra-modal and cross-modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation.

Abstract

Pre-trained on extensive and diverse multi-modal datasets, 2D foundation models excel at addressing 2D tasks with little or no downstream supervision, owing to their robust representations. The emergence of 2D-to-3D distillation frameworks has extended these capabilities to 3D models. However, distilling 3D representations for autonomous driving datasets presents challenges like self-similarity, class imbalance, and point cloud sparsity, hindering the effectiveness of contrastive distillation, especially in zero-shot learning contexts. Whereas other methodologies, such as similarity-based distillation, enhance zero-shot performance, they tend to yield less discriminative representations, diminishing few-shot performance. We investigate the gap in structure between the 2D and the 3D representations that result from state-of-the-art distillation frameworks and reveal a significant mismatch between the two. Additionally, we demonstrate that the observed structural gap is negatively correlated with the efficacy of the distilled representations on zero-shot and few-shot 3D semantic segmentation. To bridge this gap, we propose a relational distillation framework enforcing intra-modal and cross-modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation. This alignment significantly enhances 3D representation performance over those learned through contrastive distillation in zero-shot segmentation tasks. Furthermore, our relational loss consistently improves the quality of 3D representations in both in-distribution and out-of-distribution few-shot segmentation tasks, outperforming approaches that rely on the similarity loss.

Image-to-Lidar Relational Distillation for Autonomous Driving Data

TL;DR

This work investigates the gap in structure between the 2D and the 3D representations that result from state-of-the-art distillation frameworks and reveals a significant mismatch, and proposes a relational distillation framework enforcing intra-modal and cross-modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation.

Abstract

Pre-trained on extensive and diverse multi-modal datasets, 2D foundation models excel at addressing 2D tasks with little or no downstream supervision, owing to their robust representations. The emergence of 2D-to-3D distillation frameworks has extended these capabilities to 3D models. However, distilling 3D representations for autonomous driving datasets presents challenges like self-similarity, class imbalance, and point cloud sparsity, hindering the effectiveness of contrastive distillation, especially in zero-shot learning contexts. Whereas other methodologies, such as similarity-based distillation, enhance zero-shot performance, they tend to yield less discriminative representations, diminishing few-shot performance. We investigate the gap in structure between the 2D and the 3D representations that result from state-of-the-art distillation frameworks and reveal a significant mismatch between the two. Additionally, we demonstrate that the observed structural gap is negatively correlated with the efficacy of the distilled representations on zero-shot and few-shot 3D semantic segmentation. To bridge this gap, we propose a relational distillation framework enforcing intra-modal and cross-modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation. This alignment significantly enhances 3D representation performance over those learned through contrastive distillation in zero-shot segmentation tasks. Furthermore, our relational loss consistently improves the quality of 3D representations in both in-distribution and out-of-distribution few-shot segmentation tasks, outperforming approaches that rely on the similarity loss.
Paper Structure (24 sections, 8 equations, 8 figures, 10 tables)

This paper contains 24 sections, 8 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: We distill 2D representations from CLIP radford2021learningclip to a 3D point-cloud encoder using the contrastive loss, similarity loss, and our proposed relational loss, and compute the uniformity (U), tolerance (T), and modality gap (G) of the learned 3D representations. We sample $5000$ point features from each of the $16$ classes defined in the nuScenes dataset caesar2020nuscenes, apply PCA and visualize the primary components. The source U and T of the CLIP image encoder are $1.54$ and $0.73$, respectively. Compared to the source, we see that contrastive loss learns 3D representations with higher U and lower T compared to the source, while the trends are reversed for similarity loss. Our proposed relational loss minimizes this structural mismatch and leads to the lowest modality gap.
  • Figure 2: Blue: The source representation space, with a uniformity and tolerance of U=0.89 and T=0.66, respectively. Cyan: The predicted representation space. Here, $t$ denotes the number of training iterations.
  • Figure 3: Left: Input points to the MLP model in both the $3$ cluster and $1$ cluster setups. Middle: The 1 cluster setup of the toy example, with the output of the randomly initialized MLP in cyan and the target cluster in blue. Right: The 3 cluster setup, with the output of the randomly initialized MLP in cyan, light green, and pink and the target cluster in blue, green, and red.
  • Figure 4: Zero-shot segmentation performance of relational loss compared to state-of-the-art methods.
  • Figure 4: The uniformity values achieved by each loss in comparison to the uniformity of source 3D space as training progresses on the 3-cluster setup.
  • ...and 3 more figures