BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents

Yumeng Zhang; Shi Gong; Kaixin Xiong; Xiaoqing Ye; Xiaofan Li; Xiao Tan; Fan Wang; Jizhou Huang; Hua Wu; Haifeng Wang

BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents

Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiaofan Li, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

TL;DR

BEVWorld tackles the challenge of unified, multimodal world modeling for autonomous driving by mapping heterogeneous sensor data into a compact BEV latent space. It decouples a self-supervised multi-modal tokenizer from a latent BEV sequence diffusion model, enabling non-autoregressive, action-conditioned future prediction with ray-based rendering for high-quality reconstructions. The framework delivers strong performance in multi-modal future generation and demonstrates tangible gains for downstream perception and motion-prediction tasks on nuScenes and CARLA, while highlighting the benefits of pretraining the tokenizer. This work advances scene-level reasoning in 3D space and provides a scalable foundation for simulation, data augmentation, and pretraining in autonomous driving research.

Abstract

World models have attracted increasing attention in autonomous driving for their ability to forecast potential future scenarios. In this paper, we propose BEVWorld, a novel framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for holistic environment modeling. The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model. The multi-modal tokenizer first encodes heterogeneous sensory data, and its decoder reconstructs the latent BEV tokens into LiDAR and surround-view image observations via ray-casting rendering in a self-supervised manner. This enables joint modeling and bidirectional encoding-decoding of panoramic imagery and point cloud data within a shared spatial representation. On top of this, the latent BEV sequence diffusion model performs temporally consistent forecasting of future scenes, conditioned on high-level action tokens, enabling scene-level reasoning over time. Extensive experiments demonstrate the effectiveness of BEVWorld on autonomous driving benchmarks, showcasing its capability in realistic future scene generation and its benefits for downstream tasks such as perception and motion prediction.

BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents

TL;DR

Abstract

Paper Structure (23 sections, 7 equations, 12 figures, 5 tables)

This paper contains 23 sections, 7 equations, 12 figures, 5 tables.

Introduction
Related Works
World Model
Video Diffusion Model
Method
Multi-Modal Tokenizer
Latent BEV Sequence Diffusion
Experiments
Dataset
Multi-modal Tokenizer
Ablation Studies
Benefit for Downstream Tasks
Latent BEV Sequence Diffusion
Training Details.
Lidar Prediction Quality
...and 8 more sections

Figures (12)

Figure 1: An overview of our method BEVWorld. BEVWorld consists of the multi-modal tokenizer and the latent BEV sequence diffusion model. The tokenizer first encodes the image and Lidar observations into BEV tokens, then decodes the unified BEV tokens to reconstructed observations by NeRF rendering strategies. Latent BEV sequence diffusion model predicts future BEV tokens with corresponding action conditions by a Spatial-Temporal Transformer. The multi-frame future BEV tokens are obtained by a single inference, avoiding the cumulative errors of auto-regressive methods.
Figure 2: The detailed structure of BEV encoder. The encoder takes as input the multi-view multi-modality sensor data. Multimodal information is fused using deformable attention, BEV features are channel-compressed to be compatible with the diffusion models.
Figure 3: Left: Details of the multi-view images rendering. Trilinear interpolation is applied to the series of sampled points along the ray to obtain weight $w_i$ and feature $\textbf{v}_i$. {$\textbf{v}_i$} are weighted by {$w_i$} and summed, respectively, to get the rendered image features, which are concatenated and fed into the decoder for $8\times$ upsampling, resulting in multi-view RGB images. Right: Details of Lidar rendering. Trilinear interpolation is also applied to obtain weight $w_i$ and depth ${t}_i$. {${t}_i$} are weighted by {$w_i$} and summed, respectively, to get the final depth of point. Then the point in spherical coordinate system is transformed to the Cartesian coordinate system to get vanilla Lidar point coordinate.
Figure 4: The architecture of Spatial-Temporal transformer block.
Figure 5: The visualization of Lidar and video predictions.
...and 7 more figures

BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents

TL;DR

Abstract

BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents

Authors

TL;DR

Abstract

Table of Contents

Figures (12)