Table of Contents
Fetching ...

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Honghui Yang, Sha Zhang, Di Huang, Xiaoyang Wu, Haoyi Zhu, Tong He, Shixiang Tang, Hengshuang Zhao, Qibo Qiu, Binbin Lin, Xiaofei He, Wanli Ouyang

TL;DR

UniPAD introduces a universal self-supervised pre-training paradigm for autonomous driving by leveraging a 3D differentiable rendering decoder to jointly learn geometry and appearance signals. By converting LiDAR and multi-view image inputs into a unified 3D voxel space and applying memory-efficient ray sampling, UniPAD learns robust representations without explicit positive/negative sample mining. The method yields substantial improvements across 3D object detection and 3D semantic segmentation on nuScenes and demonstrates strong transfer to both 2D image backbones and multi-modal detectors. This renderer-based pre-training bridges 2D and 3D domains, enabling scalable, cross-modal representation learning with practical gains in perception tasks.

Abstract

In the context of autonomous driving, the significance of effective feature learning is widely acknowledged. While conventional 3D self-supervised pre-training methods have shown widespread success, most methods follow the ideas originally designed for 2D images. In this paper, we present UniPAD, a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. The flexibility of our method enables seamless integration into both 2D and 3D frameworks, enabling a more holistic comprehension of the scenes. We manifest the feasibility and effectiveness of UniPAD by conducting extensive experiments on various downstream 3D tasks. Our method significantly improves lidar-, camera-, and lidar-camera-based baseline by 9.1, 7.7, and 6.9 NDS, respectively. Notably, our pre-training pipeline achieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic segmentation on the nuScenes validation set, achieving state-of-the-art results in comparison with previous methods. The code will be available at https://github.com/Nightmare-n/UniPAD.

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

TL;DR

UniPAD introduces a universal self-supervised pre-training paradigm for autonomous driving by leveraging a 3D differentiable rendering decoder to jointly learn geometry and appearance signals. By converting LiDAR and multi-view image inputs into a unified 3D voxel space and applying memory-efficient ray sampling, UniPAD learns robust representations without explicit positive/negative sample mining. The method yields substantial improvements across 3D object detection and 3D semantic segmentation on nuScenes and demonstrates strong transfer to both 2D image backbones and multi-modal detectors. This renderer-based pre-training bridges 2D and 3D domains, enabling scalable, cross-modal representation learning with practical gains in perception tasks.

Abstract

In the context of autonomous driving, the significance of effective feature learning is widely acknowledged. While conventional 3D self-supervised pre-training methods have shown widespread success, most methods follow the ideas originally designed for 2D images. In this paper, we present UniPAD, a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. The flexibility of our method enables seamless integration into both 2D and 3D frameworks, enabling a more holistic comprehension of the scenes. We manifest the feasibility and effectiveness of UniPAD by conducting extensive experiments on various downstream 3D tasks. Our method significantly improves lidar-, camera-, and lidar-camera-based baseline by 9.1, 7.7, and 6.9 NDS, respectively. Notably, our pre-training pipeline achieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic segmentation on the nuScenes validation set, achieving state-of-the-art results in comparison with previous methods. The code will be available at https://github.com/Nightmare-n/UniPAD.
Paper Structure (35 sections, 4 equations, 4 figures, 14 tables)

This paper contains 35 sections, 4 equations, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Effect of our pre-training for 3D detection and segmentation on the nuScenes caesar2020nuscenes dataset, where C, L, and M denote camera, LiDAR, and fusion modality, respectively.
  • Figure 2: The overall architecture. Our framework takes LiDAR point clouds or multi-view images as input. We first propose the mask generator to partially mask the input. Next, the modal-specific encoder is adapted to extract sparse visible features, which are then converted to dense features with masked regions padded as zeros. The modality-specific features are subsequently transformed into the voxel space, followed by a projection layer to enhance voxel features. Finally, volume-based neural rendering produces RGB or depth prediction for both visible and masked regions.
  • Figure 3: Illustration of the rendering results, where the ground truth RGB and projected point clouds, rendered RGB, and rendered depth are shown on the left, middle, and right, respectively.
  • Figure 4: Illustration of ray sampling strategies: i) dilation, ii) random, and iii) depth-aware sampling.