Self-Supervised Pretraining of 3D Features on any Point-Cloud
Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra
TL;DR
DepthContrast presents a simple, registration-free self-supervised pretraining framework for 3D features that operates on unprocessed single- or multi-view depth maps and supports multiple input formats (points and voxels). It extends instance discrimination to cross-format representations via a momentum encoder and a joint loss, enabling effective pretraining across diverse 3D architectures. The method yields state-of-the-art results on ScanNet and SUNRGBD and demonstrates strong label efficiency, as well as robust generalization to outdoor LiDAR data. Key findings show that jointly training across formats and well-designed 3D data augmentations are crucial for unlocking large-capacity 3D models with limited supervision.
Abstract
Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc. However, pretraining is not widely used for 3D recognition tasks where state-of-the-art methods train models from scratch. A primary reason is the lack of large annotated datasets because 3D data is both difficult to acquire and time consuming to label. We present a simple self-supervised pertaining method that can work with any 3D data - single or multiview, indoor or outdoor, acquired by varied sensors, without 3D registration. We pretrain standard point cloud and voxel based model architectures, and show that joint pretraining further improves performance. We evaluate our models on 9 benchmarks for object detection, semantic segmentation, and object classification, where they achieve state-of-the-art results and can outperform supervised pretraining. We set a new state-of-the-art for object detection on ScanNet (69.0% mAP) and SUNRGBD (63.5% mAP). Our pretrained models are label efficient and improve performance for classes with few examples.
