Table of Contents
Fetching ...

GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving

Shaoqing Xu, Fang Li, Shengyin Jiang, Ziying Song, Li Liu, Zhi-xin Yang

TL;DR

GaussianPretrain presents a unified 3D Gaussian Splatting-based pre-training approach for autonomous driving that jointly encodes geometry and texture using learnable Gaussian anchors anchored to a 3D voxel grid. It employs LiDAR-depth guided MAE patch selection and ray-based Gaussian anchors to reconstruct RGB, depth, and occupancy from masked patches, delivering faster training and lower memory than NeRF-based methods. The method achieves significant downstream gains on nuScenes, improving 3D object detection, HD map construction, and occupancy prediction compared with ImageNet pretraining and prior NeRF-based pretraining. While demonstrating strong efficiency and robustness, the work notes limitations in temporal and multi-modal integration, pointing to future expansions toward temporal modeling and cross-modality fusion.

Abstract

Self-supervised learning has made substantial strides in image processing, while visual pre-training for autonomous driving is still in its infancy. Existing methods often focus on learning geometric scene information while neglecting texture or treating both aspects separately, hindering comprehensive scene understanding. In this context, we are excited to introduce GaussianPretrain, a novel pre-training paradigm that achieves a holistic understanding of the scene by uniformly integrating geometric and texture representations. Conceptualizing 3D Gaussian anchors as volumetric LiDAR points, our method learns a deepened understanding of scenes to enhance pre-training performance with detailed spatial structure and texture, achieving that 40.6% faster than NeRF-based method UniPAD with 70% GPU memory only. We demonstrate the effectiveness of GaussianPretrain across multiple 3D perception tasks, showing significant performance improvements, such as a 7.05% increase in NDS for 3D object detection, boosts mAP by 1.9% in HD map construction and 0.8% improvement on Occupancy prediction. These significant gains highlight GaussianPretrain's theoretical innovation and strong practical potential, promoting visual pre-training development for autonomous driving. Source code will be available at https://github.com/Public-BOTs/GaussianPretrain

GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving

TL;DR

GaussianPretrain presents a unified 3D Gaussian Splatting-based pre-training approach for autonomous driving that jointly encodes geometry and texture using learnable Gaussian anchors anchored to a 3D voxel grid. It employs LiDAR-depth guided MAE patch selection and ray-based Gaussian anchors to reconstruct RGB, depth, and occupancy from masked patches, delivering faster training and lower memory than NeRF-based methods. The method achieves significant downstream gains on nuScenes, improving 3D object detection, HD map construction, and occupancy prediction compared with ImageNet pretraining and prior NeRF-based pretraining. While demonstrating strong efficiency and robustness, the work notes limitations in temporal and multi-modal integration, pointing to future expansions toward temporal modeling and cross-modality fusion.

Abstract

Self-supervised learning has made substantial strides in image processing, while visual pre-training for autonomous driving is still in its infancy. Existing methods often focus on learning geometric scene information while neglecting texture or treating both aspects separately, hindering comprehensive scene understanding. In this context, we are excited to introduce GaussianPretrain, a novel pre-training paradigm that achieves a holistic understanding of the scene by uniformly integrating geometric and texture representations. Conceptualizing 3D Gaussian anchors as volumetric LiDAR points, our method learns a deepened understanding of scenes to enhance pre-training performance with detailed spatial structure and texture, achieving that 40.6% faster than NeRF-based method UniPAD with 70% GPU memory only. We demonstrate the effectiveness of GaussianPretrain across multiple 3D perception tasks, showing significant performance improvements, such as a 7.05% increase in NDS for 3D object detection, boosts mAP by 1.9% in HD map construction and 0.8% improvement on Occupancy prediction. These significant gains highlight GaussianPretrain's theoretical innovation and strong practical potential, promoting visual pre-training development for autonomous driving. Source code will be available at https://github.com/Public-BOTs/GaussianPretrain

Paper Structure

This paper contains 30 sections, 10 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Illustration of our proposed GaussianPretrain. A simple, innovative, and efficient framework for vision pre-training with 3D Gaussian splitting (3D-GS) representation. Benefits from our effective pre-training diagram, downstream application for 3D perception tasks achieved great improvement, including 3D object detection, HD-map construction, and Occupancy prediction.
  • Figure 2: Comparison of UVTR li2022unifying model performance on the nuScenes dataset with different pre-training framework: ImageNet, UniPAD yang2024unipad, and our GaussianPretrain.
  • Figure 3: The architecture of proposed GaussianPretrain. Given multi-view images, we first extract valid mask patches using the mask generator with the LiDAR Depth Guidance strategy. Subsequently, a set of learnable 3D Gaussian anchors is generated using ray-based guidance and conceptualized as volumetric LiDAR points. Finally, the reconstruction signals of RGB, Depth, and Occupancy are decoded based on the predicted Gaussian anchor parameters.
  • Figure 4: Process of generating valid mask patches.
  • Figure 5: Effect of GaussianPretrain on Fine-tuning. By reducing annotations from the full training set to a 1/4 subset).