Table of Contents
Fetching ...

Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting

Hao Liu, Minglin Chen, Yanni Ma, Haihong Xiao, Ying He

TL;DR

This work addresses the data annotation bottleneck in 3D vision by introducing GS$^3$, a self-supervised pre-training framework that uses 3D Gaussian Splatting to render RGB images from learned point-cloud features. By back-projecting sparse RGB-D views into 3D and predicting scene-aligned Gaussian primitives, GS$^3$ enables fast, memory-efficient neural rendering with a tile-based rasterizer, and optimizes with a color and perceptual loss $L = L_{color} + \lambda L_{lpips}$ where $\lambda=0.05$. The approach yields strong transfer to downstream 3D tasks (detection, segmentation, instance segmentation, and reconstruction) and achieves substantial efficiency gains (roughly 9× speedup and <0.25× memory) compared to prior rendering-based SSL methods like Ponder. These results demonstrate the practicality of generalizable Gaussian-based neural rendering for scalable 3D representation learning and highlight GS$^3$ as a versatile pre-training strategy for diverse 3D perception tasks.

Abstract

Pre-training on large-scale unlabeled datasets contribute to the model achieving powerful performance on 3D vision tasks, especially when annotations are limited. However, existing rendering-based self-supervised frameworks are computationally demanding and memory-intensive during pre-training due to the inherent nature of volume rendering. In this paper, we propose an efficient framework named GS$^3$ to learn point cloud representation, which seamlessly integrates fast 3D Gaussian Splatting into the rendering-based framework. The core idea behind our framework is to pre-train the point cloud encoder by comparing rendered RGB images with real RGB images, as only Gaussian points enriched with learned rich geometric and appearance information can produce high-quality renderings. Specifically, we back-project the input RGB-D images into 3D space and use a point cloud encoder to extract point-wise features. Then, we predict 3D Gaussian points of the scene from the learned point cloud features and uses a tile-based rasterizer for image rendering. Finally, the pre-trained point cloud encoder can be fine-tuned to adapt to various downstream 3D tasks, including high-level perception tasks such as 3D segmentation and detection, as well as low-level tasks such as 3D scene reconstruction. Extensive experiments on downstream tasks demonstrate the strong transferability of the pre-trained point cloud encoder and the effectiveness of our self-supervised learning framework. In addition, our GS$^3$ framework is highly efficient, achieving approximately 9$\times$ pre-training speedup and less than 0.25$\times$ memory cost compared to the previous rendering-based framework Ponder.

Point Cloud Unsupervised Pre-training via 3D Gaussian Splatting

TL;DR

This work addresses the data annotation bottleneck in 3D vision by introducing GS, a self-supervised pre-training framework that uses 3D Gaussian Splatting to render RGB images from learned point-cloud features. By back-projecting sparse RGB-D views into 3D and predicting scene-aligned Gaussian primitives, GS enables fast, memory-efficient neural rendering with a tile-based rasterizer, and optimizes with a color and perceptual loss where . The approach yields strong transfer to downstream 3D tasks (detection, segmentation, instance segmentation, and reconstruction) and achieves substantial efficiency gains (roughly 9× speedup and <0.25× memory) compared to prior rendering-based SSL methods like Ponder. These results demonstrate the practicality of generalizable Gaussian-based neural rendering for scalable 3D representation learning and highlight GS as a versatile pre-training strategy for diverse 3D perception tasks.

Abstract

Pre-training on large-scale unlabeled datasets contribute to the model achieving powerful performance on 3D vision tasks, especially when annotations are limited. However, existing rendering-based self-supervised frameworks are computationally demanding and memory-intensive during pre-training due to the inherent nature of volume rendering. In this paper, we propose an efficient framework named GS to learn point cloud representation, which seamlessly integrates fast 3D Gaussian Splatting into the rendering-based framework. The core idea behind our framework is to pre-train the point cloud encoder by comparing rendered RGB images with real RGB images, as only Gaussian points enriched with learned rich geometric and appearance information can produce high-quality renderings. Specifically, we back-project the input RGB-D images into 3D space and use a point cloud encoder to extract point-wise features. Then, we predict 3D Gaussian points of the scene from the learned point cloud features and uses a tile-based rasterizer for image rendering. Finally, the pre-trained point cloud encoder can be fine-tuned to adapt to various downstream 3D tasks, including high-level perception tasks such as 3D segmentation and detection, as well as low-level tasks such as 3D scene reconstruction. Extensive experiments on downstream tasks demonstrate the strong transferability of the pre-trained point cloud encoder and the effectiveness of our self-supervised learning framework. In addition, our GS framework is highly efficient, achieving approximately 9 pre-training speedup and less than 0.25 memory cost compared to the previous rendering-based framework Ponder.

Paper Structure

This paper contains 21 sections, 8 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Comparison of 3D detection performance mAP@0.5, 3D segmentation accuracy mIoU, pre-training time and memory consumption of Ponder Ponder and our GS$^3$. The pre-training time and memory usage of our method are measured at a rendered image resolution of 320 $\times$ 240. Due to limited computational resources, the pre-training time of Ponder with 76,800 sampling rays is estimated based on its result with 4,800 rays. Memory consumption for pre-training is reported only for Ponder with 4,800 rays.
  • Figure 2: The overall framework of the proposed GS$^3$. Given sparse-view RGB-D images, we back-project them into 3D space to generate colored point clouds. A point cloud encoder is then used to extract point-wise features, which are used to predict scene Gaussians in a point-aligned manner. These Gaussians are rendered into RGB images through a differentiable tile-based rasterizer. The point cloud encoder is pre-trained by comparing the rendered images with the real images.
  • Figure 3: The network architecture of our feature encoder. (a) SR-UNet and (b) PointNet++. For SR-UNet, each sparse (de)convolution layer is followed by a batch norm (BN) layer and a ReLU activation layer. D is the output dimension and N is the number of repeated layers. For PointNet++, SA represents the set abstraction layer, while FP denotes the feature propogation layer. np and r represent the number of down-sampling points and radiu for each SA layer.
  • Figure 4: Qualitative results of our fine-tuned model on downstream (a) 3D semantic segmentation and (b) 3D instance segmentation tasks.