Table of Contents
Fetching ...

Toward Physically Consistent Driving Video World Models under Challenging Trajectories

Jiawei Zhou, Zhenxin Zhu, Lingyi Du, Linye Lyu, Lijun Zhou, Zhanqian Wu, Hongcheng Luo, Zhuotao Tian, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Yu Li

Abstract

Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real-world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories-such as imperfect trajectories generated by simulators or planning systems-producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics-enhanced video generator that produces high-fidelity multi-view driving videos under these conditions. To effectively train these components, we construct a large-scale, physics-rich heterogeneous dataset. Specifically, in addition to real-world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories. Our project page is available at: https://wm-research.github.io/PhyGenesis/.

Toward Physically Consistent Driving Video World Models under Challenging Trajectories

Abstract

Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real-world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories-such as imperfect trajectories generated by simulators or planning systems-producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics-enhanced video generator that produces high-fidelity multi-view driving videos under these conditions. To effectively train these components, we construct a large-scale, physics-rich heterogeneous dataset. Specifically, in addition to real-world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories. Our project page is available at: https://wm-research.github.io/PhyGenesis/.

Paper Structure

This paper contains 19 sections, 13 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Qualitative comparison of video generation under diverse trajectory conditions (front view of the multi-view outputs is shown). Prior methods (e.g., DiST-4D) exhibit artifacts and geometric distortions under physically challenging trajectories, whereas PhyGenesis preserves physical consistency and high visual fidelity. Additional videos are provided in the supplementary material.
  • Figure 2: Overview of PhyGenesis. (a) Our heterogeneous multi-view dataset consists of both real-world driving data and simulated data that emphasizes physically challenging scenarios, including ego-vehicle collisions and roadway departures, among others. (b) In PhyGenesis, the physical condition generator first rectifies arbitrary 2D trajectories—potentially counterfactual or physics-violating—into physically plausible 6-DoF motions. The rectified trajectories are then projected into camera-view layout conditions and fed into a physics-enhanced video generator, co-trained on the heterogeneous dataset, to synthesize high-fidelity, physically consistent multi-view videos.
  • Figure 3: Distributions of maximum ego-vehicle acceleration for nuScenes, CARLA Ego, and CARLA ADV. The simulated CARLA datasets show a clear shift toward higher accelerations, indicating more aggressive dynamics and physically challenging events compared with the predominantly nominal driving behaviors in nuScenes.
  • Figure 4: Comparison of MLP and time-wise output head in simulating collision dynamics. The MLP shows a gradual velocity decrease after the collision, while the GT and time-wise head show an instantaneous drop to zero, producing more realistic dynamics.
  • Figure 5: Qualitative comparison with different baselines. Our method maintains the best physical consistency and produces the best visual quality.
  • ...and 8 more figures