Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting
Hao Liu, Minglin Chen, Yanni Ma, Haihong Xiao, Ying He
TL;DR
This work addresses the data annotation bottleneck in 3D vision by introducing GS$^3$, a self-supervised pre-training framework that uses 3D Gaussian Splatting to render RGB images from learned point-cloud features. By back-projecting sparse RGB-D views into 3D and predicting scene-aligned Gaussian primitives, GS$^3$ enables fast, memory-efficient neural rendering with a tile-based rasterizer, and optimizes with a color and perceptual loss $L = L_{color} + \lambda L_{lpips}$ where $\lambda=0.05$. The approach yields strong transfer to downstream 3D tasks (detection, segmentation, instance segmentation, and reconstruction) and achieves substantial efficiency gains (roughly 9× speedup and <0.25× memory) compared to prior rendering-based SSL methods like Ponder. These results demonstrate the practicality of generalizable Gaussian-based neural rendering for scalable 3D representation learning and highlight GS$^3$ as a versatile pre-training strategy for diverse 3D perception tasks.
Abstract
Pre-training on large-scale unlabeled datasets contribute to the model achieving powerful performance on 3D vision tasks, especially when annotations are limited. However, existing rendering-based self-supervised frameworks are computationally demanding and memory-intensive during pre-training due to the inherent nature of volume rendering. In this paper, we propose an efficient framework named GS$^3$ to learn point cloud representation, which seamlessly integrates fast 3D Gaussian Splatting into the rendering-based framework. The core idea behind our framework is to pre-train the point cloud encoder by comparing rendered RGB images with real RGB images, as only Gaussian points enriched with learned rich geometric and appearance information can produce high-quality renderings. Specifically, we back-project the input RGB-D images into 3D space and use a point cloud encoder to extract point-wise features. Then, we predict 3D Gaussian points of the scene from the learned point cloud features and uses a tile-based rasterizer for image rendering. Finally, the pre-trained point cloud encoder can be fine-tuned to adapt to various downstream 3D tasks, including high-level perception tasks such as 3D segmentation and detection, as well as low-level tasks such as 3D scene reconstruction. Extensive experiments on downstream tasks demonstrate the strong transferability of the pre-trained point cloud encoder and the effectiveness of our self-supervised learning framework. In addition, our GS$^3$ framework is highly efficient, achieving approximately 9$\times$ pre-training speedup and less than 0.25$\times$ memory cost compared to the previous rendering-based framework Ponder.
