S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving
Maciej K. Wozniak, Hariprasath Govindarajan, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani
TL;DR
This work tackles the challenge of self-supervised pre-training for autonomous driving under domain shifts and long-tailed object distributions. It introduces S3PT, a scene semantics and structure guided clustering framework that integrates three modules: semantic distribution consistent clustering using a von Mises-Fisher prior, scene distribution consistent clustering with depth-guided spatial clustering, and depth-guided spatial clustering that integrates LiDAR depth into the clustering cost. The method uses a teacher-student framework and depth information without requiring a LiDAR encoder, and demonstrates improvements on 2D semantic segmentation and 3D object detection across nuScenes, nuImages, and Cityscapes, with strong domain generalization. Results show substantial gains over CrIBo, especially for rare and small objects, and suggest that increasing data diversity can unlock further improvements for autonomous driving SSL.
Abstract
Recent self-supervised clustering-based pre-training techniques like DINO and Cribo have shown impressive results for downstream detection and segmentation tasks. However, real-world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.
