Table of Contents
Fetching ...

VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

Haiming Zhang, Wending Zhou, Yiyao Zhu, Xu Yan, Jiantao Gao, Dongfeng Bai, Yingjie Cai, Bingbing Liu, Shuguang Cui, Zhen Li

TL;DR

VisionPAD addresses the challenge of pre-training vision-centric autonomous driving models without depth supervision. It introduces a 3D Gaussian Splatting decoder that renders multi-view images from voxel-based features, coupled with a self-supervised voxel velocity estimator and a photometric consistency loss to learn motion and 3D geometry from pure image data. The approach yields significant improvements in 3D object detection, semantic occupancy, and map segmentation on nuScenes, outperforming prior image-only pre-training methods while reducing computational overhead compared to NeRF-style renderers. This work establishes a scalable, depth-free paradigm for vision-based perception in autonomous driving with strong data-efficiency and practical impact.

Abstract

This paper introduces VisionPAD, a novel self-supervised pre-training paradigm designed for vision-centric algorithms in autonomous driving. In contrast to previous approaches that employ neural rendering with explicit depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to reconstruct multi-view representations using only images as supervision. Specifically, we introduce a self-supervised method for voxel velocity estimation. By warping voxels to adjacent frames and supervising the rendered outputs, the model effectively learns motion cues in the sequential data. Furthermore, we adopt a multi-frame photometric consistency approach to enhance geometric perception. It projects adjacent frames to the current frame based on rendered depths and relative poses, boosting the 3D geometric representation through pure image supervision. Extensive experiments on autonomous driving datasets demonstrate that VisionPAD significantly improves performance in 3D object detection, occupancy prediction and map segmentation, surpassing state-of-the-art pre-training strategies by a considerable margin.

VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

TL;DR

VisionPAD addresses the challenge of pre-training vision-centric autonomous driving models without depth supervision. It introduces a 3D Gaussian Splatting decoder that renders multi-view images from voxel-based features, coupled with a self-supervised voxel velocity estimator and a photometric consistency loss to learn motion and 3D geometry from pure image data. The approach yields significant improvements in 3D object detection, semantic occupancy, and map segmentation on nuScenes, outperforming prior image-only pre-training methods while reducing computational overhead compared to NeRF-style renderers. This work establishes a scalable, depth-free paradigm for vision-based perception in autonomous driving with strong data-efficiency and practical impact.

Abstract

This paper introduces VisionPAD, a novel self-supervised pre-training paradigm designed for vision-centric algorithms in autonomous driving. In contrast to previous approaches that employ neural rendering with explicit depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to reconstruct multi-view representations using only images as supervision. Specifically, we introduce a self-supervised method for voxel velocity estimation. By warping voxels to adjacent frames and supervising the rendered outputs, the model effectively learns motion cues in the sequential data. Furthermore, we adopt a multi-frame photometric consistency approach to enhance geometric perception. It projects adjacent frames to the current frame based on rendered depths and relative poses, boosting the 3D geometric representation through pure image supervision. Extensive experiments on autonomous driving datasets demonstrate that VisionPAD significantly improves performance in 3D object detection, occupancy prediction and map segmentation, surpassing state-of-the-art pre-training strategies by a considerable margin.

Paper Structure

This paper contains 20 sections, 12 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison with existing methods. (a) UniPAD employs volume rendering to reconstruct the multi-view depth maps and images of the current frame, using explicit depth maps for supervision. (b) In contrast, our proposed VisionPAD leverages only multi-frame, multi-view images for supervision, effectively learning motion and geometric representations through voxel velocity estimation and photometric consistency loss. re-pro. and sup. denote the re-projection and supervision, respectively.
  • Figure 2: Overall pipeline of VisionPAD. Taking a vision-centric perception model as the backbone, VisionPAD leverages multi-frame, multi-view images as input, generating explicit voxel representations. After that, a 3D Gaussian Splatting (3DGS) Decoder reconstructs multi-view images from the voxel features. After that, velocity-guided voxel warp is applied to warp current frame voxel features to adjacent frames, enabling self-supervised reconstruction via the 3D-GS Decoder using adjacent frame images as supervision. Finally, a photometric consistency loss, informed by relative poses for re-projection, enforces 3D geometric constraints.
  • Figure 3: Self-supervise velocity estimation. Current voxel features are warped to the adjacent frame. Subsequently, multi-view images are rendered using the 3DGS Decoder and supervised by images captured in that frame.
  • Figure 4: Data efficiency with limited data. We evaluate VisionPAD's data efficiency by reducing the proportion of available annotations used during downstream fine-tuning for 3D object detection. Results highlight the effectiveness of our pre-training.
  • Figure 5: Qualitative comparison of 3D object detection between VisionPAD (top) and UniPAD (bottom) on nuScenes val set. Each predicted object instance is illustrated by a unique colored 3D bounding box. VisionPAD demonstrably mitigates both false positive and false negative detections (highlighted within red circles).
  • ...and 2 more figures