Table of Contents
Fetching ...

Physical Informed Driving World Model

Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, Wei Wu

TL;DR

DrivePhysica introduces three physically grounded modules—Coordinate System Aligner, Instance Flow Guidance, and Box Coordinate Guidance—to enforce motion reference understanding, temporal stability, and occlusion-aware spatial relationships in multi-view driving video generation. Built on a diffusion-transformer backbone with three-level control conditions, it achieves state-of-the-art video quality (FVD 38.06, FID 3.96) on nuScenes and strong downstream perception metrics, while enabling editing and CARLA-driven long-tail scenario synthesis. The work demonstrates that incorporating physical principles into driving world models yields more realistic synthetic data and effective perception training, reducing reliance on labeled real data. It also provides a framework for controllable, physically consistent video generation that can be extended to higher resolutions and more efficient inference.

Abstract

Autonomous driving requires robust perception models trained on high-quality, large-scale multi-view driving videos for tasks like 3D object detection, segmentation and trajectory prediction. While world models provide a cost-effective solution for generating realistic driving videos, challenges remain in ensuring these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationship like occlusion and spatial consistency, and temporal consistency. To address these, we propose DrivePhysica, an innovative model designed to generate realistic multi-view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks. Our project homepage: https://metadrivescape.github.io/papers_project/DrivePhysica/page.html

Physical Informed Driving World Model

TL;DR

DrivePhysica introduces three physically grounded modules—Coordinate System Aligner, Instance Flow Guidance, and Box Coordinate Guidance—to enforce motion reference understanding, temporal stability, and occlusion-aware spatial relationships in multi-view driving video generation. Built on a diffusion-transformer backbone with three-level control conditions, it achieves state-of-the-art video quality (FVD 38.06, FID 3.96) on nuScenes and strong downstream perception metrics, while enabling editing and CARLA-driven long-tail scenario synthesis. The work demonstrates that incorporating physical principles into driving world models yields more realistic synthetic data and effective perception training, reducing reliance on labeled real data. It also provides a framework for controllable, physically consistent video generation that can be extended to higher resolutions and more efficient inference.

Abstract

Autonomous driving requires robust perception models trained on high-quality, large-scale multi-view driving videos for tasks like 3D object detection, segmentation and trajectory prediction. While world models provide a cost-effective solution for generating realistic driving videos, challenges remain in ensuring these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationship like occlusion and spatial consistency, and temporal consistency. To address these, we propose DrivePhysica, an innovative model designed to generate realistic multi-view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks. Our project homepage: https://metadrivescape.github.io/papers_project/DrivePhysica/page.html

Paper Structure

This paper contains 29 sections, 13 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Limitations of previous works in modeling physical laws within driving scenarios. (a) Incorrect understanding of relative motion: In reality, the parked black car and the parked white car should exhibit slight movement relative to the ego vehicle. However, the black car remains stationary. (b) The color of the vehicle changes over time. (c) Incorrect understanding of the occlusion hierarchy: The box condition in the background is incorrectly generated in the foreground. (d) The appearance of the same car across two views is inconsistent.
  • Figure 2: Overview of DrivePhysica. (a) refers to Coordinate System Aligner module, which uses camera pose parameters to align the vehicle coordinate system with the world coordinate system. (b) refers to Instance Flow Guidance module, which utilizes the instance flow to improve temporal consistency. (c) refers to Box Coordinate Guidance module, which encodes the box coordinates to provide spatial relation information. (d) refers to Controlled Denoising Process, enabled by ST-DiT with ControlNet for unified condition control.
  • Figure 3:
  • Figure 4: Relative Motion Understanding.
  • Figure 5: Temporal Consistency.
  • ...and 8 more figures