Table of Contents
Fetching ...

OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu

TL;DR

OccSora introduces a diffusion-based 4D occupancy world model for autonomous driving, enabling long-range, trajectory-controlled generation of 4D scenes without reliance on 3D bounding boxes or maps. A 4D occupancy scene tokenizer compresses real scenes into discrete tokens, which a diffusion transformer then uses to generate future occupancy conditioned on ego trajectories, reconciling spatial-temporal coherence with physical constraints. Trained on nuScenes-Occupancy data, the approach achieves coherent 16-second video generation and demonstrates meaningful long-term scene evolution. This work advances world-modeling in autonomous driving by providing a controllable, long-horizon 4D occupancy generator that can serve as a world simulator for planning and decision-making.

Abstract

Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: https://github.com/wzzheng/OccSora.

OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

TL;DR

OccSora introduces a diffusion-based 4D occupancy world model for autonomous driving, enabling long-range, trajectory-controlled generation of 4D scenes without reliance on 3D bounding boxes or maps. A 4D occupancy scene tokenizer compresses real scenes into discrete tokens, which a diffusion transformer then uses to generate future occupancy conditioned on ego trajectories, reconciling spatial-temporal coherence with physical constraints. Trained on nuScenes-Occupancy data, the approach achieves coherent 16-second video generation and demonstrates meaningful long-term scene evolution. This work advances world-modeling in autonomous driving by providing a controllable, long-horizon 4D occupancy generator that can serve as a world simulator for planning and decision-making.

Abstract

Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: https://github.com/wzzheng/OccSora.
Paper Structure (14 sections, 4 equations, 11 figures, 4 tables)

This paper contains 14 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Comparisons with existing methods. It can comprehend the intricate relationship between scenes and trajectories and generate long-term, physically consistent 4D occupancy.
  • Figure 2: The pipeline of OccSora. The 4D occupancy scene tokenizer achieves compression and restoration of real information. The compressed information and vehicle trajectories are simultaneously used as inputs for the diffusion-based world model. After training, the diffusion-based world model utilizes random noise and arbitrary trajectories to generate controllable tokens, which are then decoded into 4D occupancy maps in the 4D occupancy scene tokenizer stage.
  • Figure 3: The structure of the 4D occupancy scene tokenizer. The proposed method encodes and compresses 4D scenes to extract high-dimensional features, which are then decoded to retrieve the spatiotemporal physical characteristics of the scenes.
  • Figure 4: The structure of the diffusion-based world model. The model involves utilizing the optimal codebook obtained from training the 4D occupancy scene tokenizer to convert 4D occupancy into a sequence of tokens. These tokens, along with the ego vehicle trajectory and random noise, are then combined as input for denoising training to acquire the generated token.
  • Figure 5: Visualization of reconstruction of the 4D occupancy scene tokenizer.
  • ...and 6 more figures