PatchContrast: Self-Supervised Pre-training for 3D Object Detection

Oren Shrout; Ori Nizan; Yizhak Ben-Shabat; Ayellet Tal

PatchContrast: Self-Supervised Pre-training for 3D Object Detection

Oren Shrout, Ori Nizan, Yizhak Ben-Shabat, Ayellet Tal

TL;DR

PatchContrast addresses the label scarcity in 3D object detection by introducing a self-supervised pre-training framework that learns at two intermediate abstraction levels: proposals (for localization) and patches (for object components). By using two augmented views, a BEV projection, and a masked-attention-based patch refinement, PatchContrast aligns both proposal- and patch-level representations while enforcing region-level discrimination through dual contrastive losses and a reconstruction objective. The approach yields state-of-the-art or competitive results on Waymo, KITTI, and ONCE, with pronounced advantages in data-scarce regimes and favorable transfer to out-of-domain datasets. Overall, PatchContrast demonstrates that multi-level SSL can significantly reduce labeling requirements while delivering robust 3D detectors in autonomous driving settings.

Abstract

Accurately detecting objects in the environment is a key challenge for autonomous vehicles. However, obtaining annotated data for detection is expensive and time-consuming. We introduce PatchContrast, a novel self-supervised point cloud pre-training framework for 3D object detection. We propose to utilize two levels of abstraction to learn discriminative representation from unlabeled data: proposal-level and patch-level. The proposal-level aims at localizing objects in relation to their surroundings, whereas the patch-level adds information about the internal connections between the object's components, hence distinguishing between different objects based on their individual components. We demonstrate how these levels can be integrated into self-supervised pre-training for various backbones to enhance the downstream 3D detection task. We show that our method outperforms existing state-of-the-art models on three commonly-used 3D detection datasets.

PatchContrast: Self-Supervised Pre-training for 3D Object Detection

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 16 figures, 10 tables)

This paper contains 21 sections, 5 equations, 16 figures, 10 tables.

Introduction
Related Work
PatchContrast framework
Proposal feature extraction
Patch feature extraction module
Region discrimination
Experiments
Implementation Details
In-domain 3D detection results (on Waymo)
Transfer learning for 3D detection (out of domain)
Additional evaluation
Ablation studies
Conclusion
Supplemental Materials
Qualitative results
...and 6 more sections

Figures (16)

Figure 1: PatchContrast overview. Contrastive learning frameworks for object detection focus on different levels of abstraction. (a) DepthContrast depthcontrast learns by contrasting a scene-level (global) representation. (b) ProposalContrast proposalcontrast learns by contrasting a proposal-level (object) representation. (c) Our PatchContrast learns by contrasting two levels: proposals and patches which inform localization and classification respectively.
Figure 2: PatchContrast framework. (a) Given a point cloud $S_0$ with two augmented views $S_1,S_2$, we first extract proposals from each view. Scene view features $F_1,F_2$ are then extracted using a backbone and mapped onto the proposals $S^1_{pr},S^2_{pr}$. To get proposal-level features $P_1,P_2$ we feed them into a proposal encoder. (b) Simultaneously, we feed the proposals into the patch feature extraction module, where patches from each proposal are extracted and encoded to get patch-level features $\tilde{P}_1,\tilde{P}_2$ (see Fig. \ref{['fig:patch_encoding_module']}). (c) Finally, the region discrimination module enforces similarity between matching proposals from the two views and between a proposal and its composing patches.
Figure 3: Patch feature extraction module. (a) Given a proposal, we extract and encode patches for patch-level representation. (b) Then, we refine the patches' representations using a masked attention auxiliary task. This is done by masking one patch embedding and reconstructing it by leveraging information from its neighbors' embeddings.
Figure 4: Data efficiency on Waymo with frozen features. When using our learned features and training the detection head on only $10\%$ of the data, our approach already outperforms the SoTA trained on $100\%$.
Figure 5: Qualitative evaluation. Clustering the embeddings reveals semantically meaningful object components, such as cars and signs--discovered without any supervision.
...and 11 more figures

PatchContrast: Self-Supervised Pre-training for 3D Object Detection

TL;DR

Abstract

PatchContrast: Self-Supervised Pre-training for 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (16)