Table of Contents
Fetching ...

UniWorld: Autonomous Driving Pre-training via World Models

Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, Bin Dai

TL;DR

UniWorld introduces a label-free, unified pre-training framework that learns a spatial-temporal 4D occupancy World Model from multi-view image-LiDAR data to enhance autonomous driving perception. The approach adds a 4D geometric occupancy decoder to BEV-based multi-camera pipelines and trains with a focal loss on occupancy labels derived from fused LiDAR frames, enabling future-state prediction and missing-state estimation. Empirical results on nuScenes show consistent gains in motion prediction, multi-camera 3D object detection, and surrounding semantic occupancy, with a notable reduction in 3D annotation costs (~25%). This World Model pretraining creates a Foundational Model that improves data efficiency and cross-task transfer, advancing practical deployment of autonomous driving systems.

Abstract

In this paper, we draw inspiration from Alberto Elfes' pioneering work in 1989, where he introduced the concept of the occupancy grid as World Models for robots. We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants. UniWorld involves initially predicting 4D geometric occupancy as the World Models for foundational stage and subsequently fine-tuning on downstream tasks. UniWorld can estimate missing information concerning the world state and predict plausible future states of the world. Besides, UniWorld's pre-training process is label-free, enabling the utilization of massive amounts of image-LiDAR pairs to build a Foundational Model.The proposed unified pre-training framework demonstrates promising results in key tasks such as motion prediction, multi-camera 3D object detection, and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniWorld shows a significant improvement of about 1.5% in IoU for motion prediction, 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniWorld.

UniWorld: Autonomous Driving Pre-training via World Models

TL;DR

UniWorld introduces a label-free, unified pre-training framework that learns a spatial-temporal 4D occupancy World Model from multi-view image-LiDAR data to enhance autonomous driving perception. The approach adds a 4D geometric occupancy decoder to BEV-based multi-camera pipelines and trains with a focal loss on occupancy labels derived from fused LiDAR frames, enabling future-state prediction and missing-state estimation. Empirical results on nuScenes show consistent gains in motion prediction, multi-camera 3D object detection, and surrounding semantic occupancy, with a notable reduction in 3D annotation costs (~25%). This World Model pretraining creates a Foundational Model that improves data efficiency and cross-task transfer, advancing practical deployment of autonomous driving systems.

Abstract

In this paper, we draw inspiration from Alberto Elfes' pioneering work in 1989, where he introduced the concept of the occupancy grid as World Models for robots. We imbue the robot with a spatial-temporal world model, termed UniWorld, to perceive its surroundings and predict the future behavior of other participants. UniWorld involves initially predicting 4D geometric occupancy as the World Models for foundational stage and subsequently fine-tuning on downstream tasks. UniWorld can estimate missing information concerning the world state and predict plausible future states of the world. Besides, UniWorld's pre-training process is label-free, enabling the utilization of massive amounts of image-LiDAR pairs to build a Foundational Model.The proposed unified pre-training framework demonstrates promising results in key tasks such as motion prediction, multi-camera 3D object detection, and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniWorld shows a significant improvement of about 1.5% in IoU for motion prediction, 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at https://github.com/chaytonmin/UniWorld.
Paper Structure (24 sections, 1 equation, 5 figures, 7 tables)

This paper contains 24 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison between monocular pre-training and our unified multi-camera pre-training. Monocular pre-training only enhances the capability of the feature extraction from a single view, whereas our proposed multi-view unified pre-training enables the incorporation of temporal and spatial information from multi-view images through World Models for pre-training.
  • Figure 2: The overall architecture of the proposed multi-camera unified pre-training method UniWorld. We first transform the multi-frame large-scale irregular LiDAR point clouds into volumetric representations as the 4D geometric occupancy labels, then add an occupancy decoder with some layers of 3D convolutions to the BEV encoder. We apply binary occupancy classification as the pretext task to distinguish whether the 4D voxel contains points. After pre-training, the lightweight decoder is discarded, and the encoder is used to warm up the backbones of downstream tasks.
  • Figure 3: Performance curves.
  • Figure 4: Label-efficiency.
  • Figure 5: Visualization of scene reconstruction via occupancy prediction.