PatchContrast: Self-Supervised Pre-training for 3D Object Detection
Oren Shrout, Ori Nizan, Yizhak Ben-Shabat, Ayellet Tal
TL;DR
PatchContrast addresses the label scarcity in 3D object detection by introducing a self-supervised pre-training framework that learns at two intermediate abstraction levels: proposals (for localization) and patches (for object components). By using two augmented views, a BEV projection, and a masked-attention-based patch refinement, PatchContrast aligns both proposal- and patch-level representations while enforcing region-level discrimination through dual contrastive losses and a reconstruction objective. The approach yields state-of-the-art or competitive results on Waymo, KITTI, and ONCE, with pronounced advantages in data-scarce regimes and favorable transfer to out-of-domain datasets. Overall, PatchContrast demonstrates that multi-level SSL can significantly reduce labeling requirements while delivering robust 3D detectors in autonomous driving settings.
Abstract
Accurately detecting objects in the environment is a key challenge for autonomous vehicles. However, obtaining annotated data for detection is expensive and time-consuming. We introduce PatchContrast, a novel self-supervised point cloud pre-training framework for 3D object detection. We propose to utilize two levels of abstraction to learn discriminative representation from unlabeled data: proposal-level and patch-level. The proposal-level aims at localizing objects in relation to their surroundings, whereas the patch-level adds information about the internal connections between the object's components, hence distinguishing between different objects based on their individual components. We demonstrate how these levels can be integrated into self-supervised pre-training for various backbones to enhance the downstream 3D detection task. We show that our method outperforms existing state-of-the-art models on three commonly-used 3D detection datasets.
