Table of Contents
Fetching ...

SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images

Jiahua Dong, Tong Wu, Rui Qian, Jiaqi Wang

TL;DR

SimC3D tackles the data bottleneck in 3D pretraining by enabling 3D backbone pretraining from pure RGB images. It synthesizes monocular point clouds from RGB images using depth estimation and trains with a lightweight, locality-focused 2D position learning target, using the InfoNCE contrastive loss $L_{InfoNCE}$. The approach achieves strong downstream performance across 3D segmentation and detection tasks, often matching or exceeding state-of-the-art methods that rely on real point clouds, and scales well with additional image data. This RGB-only pretraining paradigm reduces data collection costs and broadens the applicability of large-scale 3D foundation models for indoor scenes and beyond.

Abstract

The 3D contrastive learning paradigm has demonstrated remarkable performance in downstream tasks through pretraining on point cloud data. Recent advances involve additional 2D image priors associated with 3D point clouds for further improvement. Nonetheless, these existing frameworks are constrained by the restricted range of available point cloud datasets, primarily due to the high costs of obtaining point cloud data. To this end, we propose SimC3D, a simple but effective 3D contrastive learning framework, for the first time, pretraining 3D backbones from pure RGB image data. SimC3D performs contrastive 3D pretraining with three appealing properties. (1) Pure image data: SimC3D simplifies the dependency of costly 3D point clouds and pretrains 3D backbones using solely RBG images. By employing depth estimation and suitable data processing, the monocular synthesized point cloud shows great potential for 3D pretraining. (2) Simple framework: Traditional multi-modal frameworks facilitate 3D pretraining with 2D priors by utilizing an additional 2D backbone, thereby increasing computational expense. In this paper, we empirically demonstrate that the primary benefit of the 2D modality stems from the incorporation of locality information. Inspired by this insightful observation, SimC3D directly employs 2D positional embeddings as a stronger contrastive objective, eliminating the necessity for 2D backbones and leading to considerable performance improvements. (3) Strong performance: SimC3D outperforms previous approaches that leverage ground-truth point cloud data for pretraining in various downstream tasks. Furthermore, the performance of SimC3D can be further enhanced by combining multiple image datasets, showcasing its significant potential for scalability. The code will be available at https://github.com/Dongjiahua/SimC3D.

SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images

TL;DR

SimC3D tackles the data bottleneck in 3D pretraining by enabling 3D backbone pretraining from pure RGB images. It synthesizes monocular point clouds from RGB images using depth estimation and trains with a lightweight, locality-focused 2D position learning target, using the InfoNCE contrastive loss . The approach achieves strong downstream performance across 3D segmentation and detection tasks, often matching or exceeding state-of-the-art methods that rely on real point clouds, and scales well with additional image data. This RGB-only pretraining paradigm reduces data collection costs and broadens the applicability of large-scale 3D foundation models for indoor scenes and beyond.

Abstract

The 3D contrastive learning paradigm has demonstrated remarkable performance in downstream tasks through pretraining on point cloud data. Recent advances involve additional 2D image priors associated with 3D point clouds for further improvement. Nonetheless, these existing frameworks are constrained by the restricted range of available point cloud datasets, primarily due to the high costs of obtaining point cloud data. To this end, we propose SimC3D, a simple but effective 3D contrastive learning framework, for the first time, pretraining 3D backbones from pure RGB image data. SimC3D performs contrastive 3D pretraining with three appealing properties. (1) Pure image data: SimC3D simplifies the dependency of costly 3D point clouds and pretrains 3D backbones using solely RBG images. By employing depth estimation and suitable data processing, the monocular synthesized point cloud shows great potential for 3D pretraining. (2) Simple framework: Traditional multi-modal frameworks facilitate 3D pretraining with 2D priors by utilizing an additional 2D backbone, thereby increasing computational expense. In this paper, we empirically demonstrate that the primary benefit of the 2D modality stems from the incorporation of locality information. Inspired by this insightful observation, SimC3D directly employs 2D positional embeddings as a stronger contrastive objective, eliminating the necessity for 2D backbones and leading to considerable performance improvements. (3) Strong performance: SimC3D outperforms previous approaches that leverage ground-truth point cloud data for pretraining in various downstream tasks. Furthermore, the performance of SimC3D can be further enhanced by combining multiple image datasets, showcasing its significant potential for scalability. The code will be available at https://github.com/Dongjiahua/SimC3D.

Paper Structure

This paper contains 32 sections, 7 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: SimC3D performs contrastive 3D pretraining with three appealing properties.(1) Pure image data: It simplifies the requirements of pre-training datasets, transitioning from expensive 3D point clouds to pure RGB images. (2) Simple framework: In contrast to previous multi-modal frameworks that rely on an additional 2D backbone, e.g., ResNet resnet, to encode image features, SimC3D directly employs 2D positional embeddings as the training objective, thereby eliminating the need for a 2D encoder within the framework. (3) Strong performance: Although simplifying the data requirements and training framework, SimC3D can still achieve strong performance across various downstream tasks, as shown in the radar plot where 'SOTA' represents the highest score achieved in prior works.
  • Figure 1: Analysis of different backbones: There's no significant difference between different backbones, even for the random initialized and frozen ResNet resnet.
  • Figure 2: PCA analysis on points' 2D target feature.
  • Figure 3: SimC3D Overview: Given an RGB image, we first use MiDaS midas to extract the inverse depth map, and then project it to a point cloud with fixed camera calibration parameters. Then, we adopt a suitable point cloud processing and extract the online branch feature. For the target branch, we directly apply the 2D positional encoding 2dpos in the 2D branch. Each point's target is directly sampled from the feature map by their 2D coordinates. Finally, a contrastive infoNCE loss and a RGB reconstruction loss is used for pre-training
  • Figure I: Data efficiency comparison. We conduct comparisons on limited scenes and limited annotations, using Sun RGB-D detection benchmark and PointNet++ as the backbone. Our SimC3D method shows consistently better results.
  • ...and 2 more figures