Table of Contents
Fetching ...

Self-Supervised Pretraining of 3D Features on any Point-Cloud

Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra

TL;DR

DepthContrast presents a simple, registration-free self-supervised pretraining framework for 3D features that operates on unprocessed single- or multi-view depth maps and supports multiple input formats (points and voxels). It extends instance discrimination to cross-format representations via a momentum encoder and a joint loss, enabling effective pretraining across diverse 3D architectures. The method yields state-of-the-art results on ScanNet and SUNRGBD and demonstrates strong label efficiency, as well as robust generalization to outdoor LiDAR data. Key findings show that jointly training across formats and well-designed 3D data augmentations are crucial for unlocking large-capacity 3D models with limited supervision.

Abstract

Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc. However, pretraining is not widely used for 3D recognition tasks where state-of-the-art methods train models from scratch. A primary reason is the lack of large annotated datasets because 3D data is both difficult to acquire and time consuming to label. We present a simple self-supervised pertaining method that can work with any 3D data - single or multiview, indoor or outdoor, acquired by varied sensors, without 3D registration. We pretrain standard point cloud and voxel based model architectures, and show that joint pretraining further improves performance. We evaluate our models on 9 benchmarks for object detection, semantic segmentation, and object classification, where they achieve state-of-the-art results and can outperform supervised pretraining. We set a new state-of-the-art for object detection on ScanNet (69.0% mAP) and SUNRGBD (63.5% mAP). Our pretrained models are label efficient and improve performance for classes with few examples.

Self-Supervised Pretraining of 3D Features on any Point-Cloud

TL;DR

DepthContrast presents a simple, registration-free self-supervised pretraining framework for 3D features that operates on unprocessed single- or multi-view depth maps and supports multiple input formats (points and voxels). It extends instance discrimination to cross-format representations via a momentum encoder and a joint loss, enabling effective pretraining across diverse 3D architectures. The method yields state-of-the-art results on ScanNet and SUNRGBD and demonstrates strong label efficiency, as well as robust generalization to outdoor LiDAR data. Key findings show that jointly training across formats and well-designed 3D data augmentations are crucial for unlocking large-capacity 3D models with limited supervision.

Abstract

Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc. However, pretraining is not widely used for 3D recognition tasks where state-of-the-art methods train models from scratch. A primary reason is the lack of large annotated datasets because 3D data is both difficult to acquire and time consuming to label. We present a simple self-supervised pertaining method that can work with any 3D data - single or multiview, indoor or outdoor, acquired by varied sensors, without 3D registration. We pretrain standard point cloud and voxel based model architectures, and show that joint pretraining further improves performance. We evaluate our models on 9 benchmarks for object detection, semantic segmentation, and object classification, where they achieve state-of-the-art results and can outperform supervised pretraining. We set a new state-of-the-art for object detection on ScanNet (69.0% mAP) and SUNRGBD (63.5% mAP). Our pretrained models are label efficient and improve performance for classes with few examples.

Paper Structure

This paper contains 33 sections, 3 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Label-efficiency of our self-supervised pretraining. We finetune detection models from scratch or using our pretraining as initialization. Our pretraining which uses unlabeled single-view 3D data, outperforms training from scratch, and achieves the same detection performance with about half the detection labels.
  • Figure 2: Approach Overview. We propose DepthContrast - a simple 3D representation learning method that uses large amounts of unprocessed single/multi-view depth maps. Given a depth map we construct two augmented versions using data augmentation and represent them with different input formats (point coordinates and voxels). We use format-specific encoders to get spatial features which are pooled and projected to obtain global features $\mathbf{v}$. The global features are used to setup an instance discrimination task and pretrain the encoders.
  • Figure 3: Scaling the model size and pretraining data. We increase the model capacity of the PointNet++ model by increasing the width by $\{2\times,3\times,4\times\}$. When training from scratch, increasing the model capacity increases the performance but ultimately leads to overfitting. Overfitting is more pronounced on small datasets like S3DIS. Our DepthContrast pretraining on ScanNet-vid improves the performance for larger models and reduces overfitting. We increase the pretraining data by combining the readily available single-view depth maps from ScanNet-vid and Redwood-vid. DepthContrast's performance improves significantly when using both large data and large models.
  • Figure 4: Pretraining benefits long tail classes. We analyze the gain of our pretraining across different classes for SUNRGBD object detection. The training data has a long tailed distribution where the least frequent classes occur $50\times$ less than the most frequent classes. Our pretraining improves performance for classes with fewer labeled instances by $4-8\%$. (Trending line in green.)
  • Figure 5: Using outdoor LiDAR data. We finetune detection models from scratch or using our pretraining and report mAP (with 40 recall positions) on the cyclist class at moderate difficulty level of the KITTI val split. Our models are pretrained using unlabeled outdoor data from the Waymo dataset and outperform scratch training using either point (left) or voxel (right) inputs.
  • ...and 7 more figures