SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images

Jiahua Dong; Tong Wu; Rui Qian; Jiaqi Wang

SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images

Jiahua Dong, Tong Wu, Rui Qian, Jiaqi Wang

TL;DR

SimC3D tackles the data bottleneck in 3D pretraining by enabling 3D backbone pretraining from pure RGB images. It synthesizes monocular point clouds from RGB images using depth estimation and trains with a lightweight, locality-focused 2D position learning target, using the InfoNCE contrastive loss $L_{InfoNCE}$. The approach achieves strong downstream performance across 3D segmentation and detection tasks, often matching or exceeding state-of-the-art methods that rely on real point clouds, and scales well with additional image data. This RGB-only pretraining paradigm reduces data collection costs and broadens the applicability of large-scale 3D foundation models for indoor scenes and beyond.

Abstract

The 3D contrastive learning paradigm has demonstrated remarkable performance in downstream tasks through pretraining on point cloud data. Recent advances involve additional 2D image priors associated with 3D point clouds for further improvement. Nonetheless, these existing frameworks are constrained by the restricted range of available point cloud datasets, primarily due to the high costs of obtaining point cloud data. To this end, we propose SimC3D, a simple but effective 3D contrastive learning framework, for the first time, pretraining 3D backbones from pure RGB image data. SimC3D performs contrastive 3D pretraining with three appealing properties. (1) Pure image data: SimC3D simplifies the dependency of costly 3D point clouds and pretrains 3D backbones using solely RBG images. By employing depth estimation and suitable data processing, the monocular synthesized point cloud shows great potential for 3D pretraining. (2) Simple framework: Traditional multi-modal frameworks facilitate 3D pretraining with 2D priors by utilizing an additional 2D backbone, thereby increasing computational expense. In this paper, we empirically demonstrate that the primary benefit of the 2D modality stems from the incorporation of locality information. Inspired by this insightful observation, SimC3D directly employs 2D positional embeddings as a stronger contrastive objective, eliminating the necessity for 2D backbones and leading to considerable performance improvements. (3) Strong performance: SimC3D outperforms previous approaches that leverage ground-truth point cloud data for pretraining in various downstream tasks. Furthermore, the performance of SimC3D can be further enhanced by combining multiple image datasets, showcasing its significant potential for scalability. The code will be available at https://github.com/Dongjiahua/SimC3D.

SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images

TL;DR

Abstract

SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)