Table of Contents
Fetching ...

Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining

Tianfang Sun, Zhizhong Zhang, Xin Tan, Yanyun Qu, Yuan Xie

TL;DR

The paper tackles the underutilization of sweep data and cross-frame self-conflict in LiDAR–camera pretraining for 3D semantic segmentation. It introduces a VFM-driven sample exploring module to harvest synchronized, content-diverse LiDAR–Image pairs from sweeps and a cross-/intra-modal conflict-aware contrastive loss that leverages Vision Foundation Model masks to avoid incorrect negatives. Empirically, it achieves state-of-the-art finetuning results on nuScenes, SemanticKITTI, Waymo, and Synth4D, with strong backbone and mask-generalization, and demonstrates improved semantic consistency in learned embeddings. These techniques enhance representation learning for autonomous driving perception and show practical potential for broader VFM-based 3D pretraining.

Abstract

LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications. However, two issues widely exist in this framework: 1) Solely keyframes are used for training. For example, in nuScenes, a substantial quantity of unpaired LiDAR and camera frames remain unutilized, limiting the representation capabilities of the pretrained network. 2) The contrastive loss erroneously distances points and image regions with identical semantics but from different frames, disturbing the semantic consistency of the learned presentations. In this paper, we propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames, enriching the original training set. We utilized timestamps and the semantic priors from VFMs to identify well-synchronized training pairs and to discover samples with diverse content. Moreover, we design a cross- and intra-modal conflict-aware contrastive loss using the semantic mask labels of VFMs to avoid contrasting semantically similar points and image regions. Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets: nuScenes, SemanticKITTI, and Waymo on 3D semantic segmentation by +3.0\%, +3.0\%, and +3.3\% in mIoU, respectively. Furthermore, our approach exhibits adaptable generalization to different 3D backbones and typical semantic masks generated by non-VFM models.

Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining

TL;DR

The paper tackles the underutilization of sweep data and cross-frame self-conflict in LiDAR–camera pretraining for 3D semantic segmentation. It introduces a VFM-driven sample exploring module to harvest synchronized, content-diverse LiDAR–Image pairs from sweeps and a cross-/intra-modal conflict-aware contrastive loss that leverages Vision Foundation Model masks to avoid incorrect negatives. Empirically, it achieves state-of-the-art finetuning results on nuScenes, SemanticKITTI, Waymo, and Synth4D, with strong backbone and mask-generalization, and demonstrates improved semantic consistency in learned embeddings. These techniques enhance representation learning for autonomous driving perception and show practical potential for broader VFM-based 3D pretraining.

Abstract

LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications. However, two issues widely exist in this framework: 1) Solely keyframes are used for training. For example, in nuScenes, a substantial quantity of unpaired LiDAR and camera frames remain unutilized, limiting the representation capabilities of the pretrained network. 2) The contrastive loss erroneously distances points and image regions with identical semantics but from different frames, disturbing the semantic consistency of the learned presentations. In this paper, we propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames, enriching the original training set. We utilized timestamps and the semantic priors from VFMs to identify well-synchronized training pairs and to discover samples with diverse content. Moreover, we design a cross- and intra-modal conflict-aware contrastive loss using the semantic mask labels of VFMs to avoid contrasting semantically similar points and image regions. Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets: nuScenes, SemanticKITTI, and Waymo on 3D semantic segmentation by +3.0\%, +3.0\%, and +3.3\% in mIoU, respectively. Furthermore, our approach exhibits adaptable generalization to different 3D backbones and typical semantic masks generated by non-VFM models.
Paper Structure (11 sections, 4 equations, 6 figures, 5 tables)

This paper contains 11 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of few-shot finetuning performance on three public autonomous driving benchmarks caesar2020nuscenesbehley2019semantickittisun2020scalability on two popular backbones choy20194dtang2020searching. The percentage after the dataset name denotes the percentage of annotation used for finetuning. Note that our methods are all pretrained on the nuScenes caesar2020nuscenes. Our method significantly outperforms other recent cross-modal contrastive learning methods sautier2022imagemahmoud2023selfliu2023segment on all three datasets under all the annotation settings.
  • Figure 2: The overall frameworks. We sample both temporally-synchronized and distinct LiDAR-Image pairs from the untapped sweeps set with our VFM-driven sample exploring module (\ref{['sec:vse']}). The LiDAR-Image pairs are embedded into a unified feature space by the corresponding backbones and grouped by the VFM masks. The pretraining objective is conflict-aware contrastive learning (\ref{['sec:ccl']}), including the cross-modal conflict-aware contrastive loss and the intra-modal conflict-aware contrastive loss.
  • Figure 3: The VFM-driven sample exploring module. The top part of the figure is divided into three parts: Left: the overall pipeline of the proposed module. Middle: mIoU score calculation between any sweeps and keyframes. Right: The statistics of the sweeps selected by our module. The $\sigma(\Delta T)$ and $\delta(\Delta T)$ denote the mean and standard deviation of the timestamp difference. The $count$ represents the total number of samples in the corresponding set. The bottom part depicts two examples of the image of the selected sweeps (i.e. images with red boundary).
  • Figure 4: Proxy experiments. From left to right, the first and third figures depict the cross-frame "self-conflict" issue in cross- and intra-modal domains, respectively. The second and last figures report the proxy experiment results, where we report the 1% finetune results on the nuScenes dataset after pretraining under different settings.
  • Figure 5: The intra- and cross-modal cosine similarity between a query point (emphasized by the orange circle) and the feature learned with our method and SLidR* sautier2022image. The color goes from blue to red, denoting low and high similarity scores, respectively. The orange arrow points out the location of the objects in the images in the corresponding point cloud. In the top row, we calculate the cosine similarity between the 3D feature of the query point and the 2D feature of all pixels in the paired input sample. In the bottom row, we calculate the cosine similarity between the 3D feature of the query point and the entire point cloud.
  • ...and 1 more figures