Table of Contents
Fetching ...

Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving

Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, Yilun Chen

TL;DR

The paper tackles the high cost of 3D occupancy labeling for autonomous driving and the need to forecast future scenes from vision. It introduces PreWorld, a semi-supervised vision-centric 3D occupancy world model, featuring a two-stage training paradigm that leverages abundant 2D labels for self-supervised pre-training and 3D occupancy labels for fine-tuning, enabling end-to-end forecasting from image inputs. A simple state-conditioned forecasting module allows joint optimization with the occupancy network, reducing information loss typical of token-based forecasters. Empirical results on Occ3D-nuScenes show competitive 3D occupancy prediction, state-of-the-art 4D occupancy forecasting, and strong motion planning, validating the approach and its scalability for large-scale training in autonomous driving scenarios.

Abstract

Understanding world dynamics is crucial for planning in autonomous driving. Recent methods attempt to achieve this by learning a 3D occupancy world model that forecasts future surrounding scenes based on current observation. However, 3D occupancy labels are still required to produce promising results. Considering the high annotation cost for 3D outdoor scenes, we propose a semi-supervised vision-centric 3D occupancy world model, PreWorld, to leverage the potential of 2D labels through a novel two-stage training paradigm: the self-supervised pre-training stage and the fully-supervised fine-tuning stage. Specifically, during the pre-training stage, we utilize an attribute projection head to generate different attribute fields of a scene (e.g., RGB, density, semantic), thus enabling temporal supervision from 2D labels via volume rendering techniques. Furthermore, we introduce a simple yet effective state-conditioned forecasting module to recursively forecast future occupancy and ego trajectory in a direct manner. Extensive experiments on the nuScenes dataset validate the effectiveness and scalability of our method, and demonstrate that PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.

Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving

TL;DR

The paper tackles the high cost of 3D occupancy labeling for autonomous driving and the need to forecast future scenes from vision. It introduces PreWorld, a semi-supervised vision-centric 3D occupancy world model, featuring a two-stage training paradigm that leverages abundant 2D labels for self-supervised pre-training and 3D occupancy labels for fine-tuning, enabling end-to-end forecasting from image inputs. A simple state-conditioned forecasting module allows joint optimization with the occupancy network, reducing information loss typical of token-based forecasters. Empirical results on Occ3D-nuScenes show competitive 3D occupancy prediction, state-of-the-art 4D occupancy forecasting, and strong motion planning, validating the approach and its scalability for large-scale training in autonomous driving scenarios.

Abstract

Understanding world dynamics is crucial for planning in autonomous driving. Recent methods attempt to achieve this by learning a 3D occupancy world model that forecasts future surrounding scenes based on current observation. However, 3D occupancy labels are still required to produce promising results. Considering the high annotation cost for 3D outdoor scenes, we propose a semi-supervised vision-centric 3D occupancy world model, PreWorld, to leverage the potential of 2D labels through a novel two-stage training paradigm: the self-supervised pre-training stage and the fully-supervised fine-tuning stage. Specifically, during the pre-training stage, we utilize an attribute projection head to generate different attribute fields of a scene (e.g., RGB, density, semantic), thus enabling temporal supervision from 2D labels via volume rendering techniques. Furthermore, we introduce a simple yet effective state-conditioned forecasting module to recursively forecast future occupancy and ego trajectory in a direct manner. Extensive experiments on the nuScenes dataset validate the effectiveness and scalability of our method, and demonstrate that PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.

Paper Structure

This paper contains 37 sections, 10 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: (a) Self-Supervised 3D Occupancy Model can be trained using solely 2D labels as supervision. However, it lacks the capability to forecast future occupancy. In contrast, (b) Fully-Supervised 3D Occupancy World Model can forecast future occupancy, but it relies on 3D occupancy labels for meaningful results due to its indirect architecture, which employs a frozen 3D occupancy model. To tackle these challenges, our (c) Semi-Supervised 3D Occupancy World Model, featuring 2D rendering supervision and an end-to-end architecture, can forecast future occupancy straightly from image inputs while taking advantage of 2D labels.
  • Figure 2: The architecture of our proposed PreWorld. Firstly, volume features are extracted from multi-view images with an occupancy network. Subsequently, a state-conditioned forecasting module is employed to recursively forecast future volume features using historical features. In the self-supervised pre-training stage, volume features are projected into various attribute fields and supervised by 2D labels through volume rendering techniques. In the fully-supervised fine-tuning stage, the attribute projection head no longer participates in the computations, occupancy predictions are directly obtained via an occupancy head and supervised by 3D occupancy labels.
  • Figure 3: The proposed state-conditioned forecasting module is simply composed of two MLPs. Ego states can be optionally integrated into the network, as denoted by the dashed arrows.
  • Figure 4: Qualitative results of 3D occupancy prediction on the Occ3D-nuScenes validation set. The holistic structure and fine-grained details of the scene are highlighted by orange boxes and red boxes respectively. Compared with existing fully-supervised methods and self-supervised methods, PreWorld can obtain better scene structure and capture finer local details.
  • Figure 5: More qualitative results of 3D occupancy prediction on the Occ3D-nuScenes validation set. The holistic structure and fine-grained details of the scene are highlighted by orange boxes and red boxes respectively.
  • ...and 1 more figures