DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

Chen Shi; Shaoshuai Shi; Kehua Sheng; Bo Zhang; Li Jiang

DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

Chen Shi, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang

TL;DR

DriveX tackles the challenge of out-of-distribution generalization in autonomous driving by learning a self-supervised world model that captures general scene dynamics in a latent BEV space. It introduces Omni Scene Modeling to fuse 3D geometry, 2D semantics, and image texture through multimodal self-supervision, and adopts a decoupled latent learning approach with flow-based future forecasting and dynamic-aware ray sampling to model motion. The Future Spatial Attention mechanism then integrates the predicted future BEV features with downstream tasks, achieving improvements across 3D point-cloud forecasting, occupancy prediction, occupancy flow estimation, and end-to-end driving. Empirical results on nuScenes and NAVSIM demonstrate state-of-the-art or strong performance gains with low overhead, highlighting DriveX as a practical, general-purpose world model for robust autonomous driving.

Abstract

Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.

DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

TL;DR

Abstract

DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)