Table of Contents
Fetching ...

PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding

Saining Xie, Jiatao Gu, Demi Guo, Charles R. Qi, Leonidas J. Guibas, Or Litany

TL;DR

The paper addresses the challenge of achieving transferable 3D scene understanding with limited labeled data by introducing PointContrast, an unsupervised pre-training framework for 3D point clouds. It pre-trains a unified SR‑UNet backbone on a large set of ScanNet scene pairs using dense point‑level contrastive losses (PointInfoNCE and Hardest‑Contrastive) and then fine‑tunes on diverse downstream tasks (segmentation, detection) across indoor and outdoor, real and synthetic datasets. The results show consistent performance gains across six benchmarks, with the unsupervised method approaching supervised pre‑training in effectiveness and often matching it, particularly when labeled data is scarce. These findings suggest that scaling unlabeled 3D data can be more impactful than refining annotations, signaling a paradigm shift toward large‑scale unsupervised pre‑training for 3D representation learning and high‑level scene understanding.

Abstract

Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (eg., ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as an opportunity considering the effort required for annotating data in 3D. In this work, we aim at facilitating research on 3D representation learning. Different from previous works, we focus on high-level scene understanding tasks. To this end, we select a suite of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes. Our findings are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -- demonstrating that the learned representation can generalize across domains. Furthermore, the improvement was similar to supervised pre-training, suggesting that future efforts should favor scaling data collection over more detailed annotation. We hope these findings will encourage more research on unsupervised pretext task design for 3D deep learning.

PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding

TL;DR

The paper addresses the challenge of achieving transferable 3D scene understanding with limited labeled data by introducing PointContrast, an unsupervised pre-training framework for 3D point clouds. It pre-trains a unified SR‑UNet backbone on a large set of ScanNet scene pairs using dense point‑level contrastive losses (PointInfoNCE and Hardest‑Contrastive) and then fine‑tunes on diverse downstream tasks (segmentation, detection) across indoor and outdoor, real and synthetic datasets. The results show consistent performance gains across six benchmarks, with the unsupervised method approaching supervised pre‑training in effectiveness and often matching it, particularly when labeled data is scarce. These findings suggest that scaling unlabeled 3D data can be more impactful than refining annotations, signaling a paradigm shift toward large‑scale unsupervised pre‑training for 3D representation learning and high‑level scene understanding.

Abstract

Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (eg., ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as an opportunity considering the effort required for annotating data in 3D. In this work, we aim at facilitating research on 3D representation learning. Different from previous works, we focus on high-level scene understanding tasks. To this end, we select a suite of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes. Our findings are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -- demonstrating that the learned representation can generalize across domains. Furthermore, the improvement was similar to supervised pre-training, suggesting that future efforts should favor scaling data collection over more detailed annotation. We hope these findings will encourage more research on unsupervised pretext task design for 3D deep learning.

Paper Structure

This paper contains 46 sections, 2 equations, 4 figures, 15 tables, 2 algorithms.

Figures (4)

  • Figure 1: Training from scratch vs. fine-tuning with ShapeNet pre-trained weights.
  • Figure 2: PointContrast: Pretext task for 3D pre-training.
  • Figure 3: SR-UNet architecture we used as a shared backbone network for pre-training and fine-tuning tasks. For segmentation and detection tasks, both the encoder and decoder weights are fine-tuned; for classification downstream tasks, only the encoder network is kept and fine-tuned.
  • Figure 4: Visualization of the ScanNet point cloud pair dataset used for pre-training. Each row is a randomly sampled scene. Each column is a different pair of point clouds sampled from the same scene. Different colors are corresponding to two different views (partial scans). At least 30% of the points are overlapping in two views.