Table of Contents
Fetching ...

S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving

Maciej K. Wozniak, Hariprasath Govindarajan, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani

TL;DR

This work tackles the challenge of self-supervised pre-training for autonomous driving under domain shifts and long-tailed object distributions. It introduces S3PT, a scene semantics and structure guided clustering framework that integrates three modules: semantic distribution consistent clustering using a von Mises-Fisher prior, scene distribution consistent clustering with depth-guided spatial clustering, and depth-guided spatial clustering that integrates LiDAR depth into the clustering cost. The method uses a teacher-student framework and depth information without requiring a LiDAR encoder, and demonstrates improvements on 2D semantic segmentation and 3D object detection across nuScenes, nuImages, and Cityscapes, with strong domain generalization. Results show substantial gains over CrIBo, especially for rare and small objects, and suggest that increasing data diversity can unlock further improvements for autonomous driving SSL.

Abstract

Recent self-supervised clustering-based pre-training techniques like DINO and Cribo have shown impressive results for downstream detection and segmentation tasks. However, real-world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.

S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving

TL;DR

This work tackles the challenge of self-supervised pre-training for autonomous driving under domain shifts and long-tailed object distributions. It introduces S3PT, a scene semantics and structure guided clustering framework that integrates three modules: semantic distribution consistent clustering using a von Mises-Fisher prior, scene distribution consistent clustering with depth-guided spatial clustering, and depth-guided spatial clustering that integrates LiDAR depth into the clustering cost. The method uses a teacher-student framework and depth information without requiring a LiDAR encoder, and demonstrates improvements on 2D semantic segmentation and 3D object detection across nuScenes, nuImages, and Cityscapes, with strong domain generalization. Results show substantial gains over CrIBo, especially for rare and small objects, and suggest that increasing data diversity can unlock further improvements for autonomous driving SSL.

Abstract

Recent self-supervised clustering-based pre-training techniques like DINO and Cribo have shown impressive results for downstream detection and segmentation tasks. However, real-world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.

Paper Structure

This paper contains 31 sections, 2 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Visualization of our main contribution using scene semantics and structure-guided clustering (S3PT) in self-supervised pre-training. The baseline CrIBo (top) provides inconsistent clusters on autonomous driving data , failing to capture geometric information and struggling with imbalanced object sizes and classes due to its strong encouragement of uniform distributions. In contrast, S3PT (bottom) offers scene-consistent clustering (e.g., correctly differentiating the two cars) thanks to our contributions.
  • Figure 2: Overview of S3PT. For each image, two different views are extracted and fed into teacher and student networks (same architecture, we use ViT, however, any network can be used). Based on teacher network features and depth information from LiDAR, a object-diversity consistent joint-view clustering is performed to extract object-level features. Finally, the student is trained using a vMF-normalized loss formulation which enables flexible and non-uniform semantic distributions.
  • Figure 3: Average number of pixels per class in nuImages dataset.
  • Figure 4: A depth-guided clustering is proposed by using the depth maps to modify the cost matrix. The spatial clustering is modified to use a larger number of clusters and relaxing the uniformity assumption in the Sinkhorn-Knopp algorithm, to enable identification of objects of diverse sizes in a scene.
  • Figure 5: Object-wise segmentation performance of models from Tab. \ref{['tab:ablation_results']} using the Mask Transformer head. The models are obtained by sequentially applying the proposed modifications to CrIBo baseline to finally achieve S3PT. Our proposed modifications improve performance of all classes and particularly improved performance is observed on underrepresented classes like bus and construction vehicle and also small-sized objects as bicycle, motorcycle, pedestrians and traffic cones.
  • ...and 5 more figures