OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving
Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, Jiwen Lu
TL;DR
OccSora introduces a diffusion-based 4D occupancy world model for autonomous driving, enabling long-range, trajectory-controlled generation of 4D scenes without reliance on 3D bounding boxes or maps. A 4D occupancy scene tokenizer compresses real scenes into discrete tokens, which a diffusion transformer then uses to generate future occupancy conditioned on ego trajectories, reconciling spatial-temporal coherence with physical constraints. Trained on nuScenes-Occupancy data, the approach achieves coherent 16-second video generation and demonstrates meaningful long-term scene evolution. This work advances world-modeling in autonomous driving by providing a controllable, long-horizon 4D occupancy generator that can serve as a world simulator for planning and decision-making.
Abstract
Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: https://github.com/wzzheng/OccSora.
