Table of Contents
Fetching ...

DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

Chen Shi, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang

TL;DR

DriveX tackles the challenge of out-of-distribution generalization in autonomous driving by learning a self-supervised world model that captures general scene dynamics in a latent BEV space. It introduces Omni Scene Modeling to fuse 3D geometry, 2D semantics, and image texture through multimodal self-supervision, and adopts a decoupled latent learning approach with flow-based future forecasting and dynamic-aware ray sampling to model motion. The Future Spatial Attention mechanism then integrates the predicted future BEV features with downstream tasks, achieving improvements across 3D point-cloud forecasting, occupancy prediction, occupancy flow estimation, and end-to-end driving. Empirical results on nuScenes and NAVSIM demonstrate state-of-the-art or strong performance gains with low overhead, highlighting DriveX as a practical, general-purpose world model for robust autonomous driving.

Abstract

Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.

DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

TL;DR

DriveX tackles the challenge of out-of-distribution generalization in autonomous driving by learning a self-supervised world model that captures general scene dynamics in a latent BEV space. It introduces Omni Scene Modeling to fuse 3D geometry, 2D semantics, and image texture through multimodal self-supervision, and adopts a decoupled latent learning approach with flow-based future forecasting and dynamic-aware ray sampling to model motion. The Future Spatial Attention mechanism then integrates the predicted future BEV features with downstream tasks, achieving improvements across 3D point-cloud forecasting, occupancy prediction, occupancy flow estimation, and end-to-end driving. Empirical results on nuScenes and NAVSIM demonstrate state-of-the-art or strong performance gains with low overhead, highlighting DriveX as a practical, general-purpose world model for robust autonomous driving.

Abstract

Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.

Paper Structure

This paper contains 12 sections, 10 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Comparison of different methods for integrating world models in autonomous driving systems. (a) World modeling serves as a pretraining task or an auxiliary supervision for driving models, without fully capturing general world knowledge. (b) Future prediction is performed in video space to assist downstream tasks, requiring additional 2D-to-3D transformation. (c) Our proposed DriveX framework decodes future states in latent 3D space, learning task-agnostic features that seamlessly benefit driving tasks.
  • Figure 2: Top: An overview of our DriveX framework. The learning process consists of two stages: world representation learning, where the model learns temporal and geometric semantics through Omni Scene Modeling, and latent future decoding, where the model predicts future states in the learned latent space. Both stages are trained in a self-supervised manner. Bottom: Illustration of the FSA paradigm. Various driving tasks can dynamically aggregate information from predicted latent features through FSA.
  • Figure 3: Illustration of the flow-based future forecasting pipeline. (a) The detailed structures of future decoder, consisting of a future propagation module, a motion head "M", and a refinement 2D convolutional backbone. (b) Grid points with predicted flows are transformed to time $t+k$ coordinates, followed by distance-based interpolation to obtain the future grid features.