Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space
Jian Zhu, Zhengyu Jia, Tian Gao, Jiaxin Deng, Shidi Li, Lang Zhang, Fu Liu, Peng Jia, Xianpeng Lang
TL;DR
This work tackles the limitation of ego-centric driving world models by introducing EOT-WM, a framework that unifies ego and other vehicle trajectories in video space. It converts BEV trajectories into per-vehicle trajectory videos, encodes them alongside driving videos with a Spatial-Temporal Variational Autoencoder, and uses a Trajectory-injected Diffusion Transformer to generate future frames conditioned on initial frames and trajectory guidance. Key contributions include the video-space trajectory representation, aligned latent spaces for motion guidance, and a trajectory controllability metric, all validated on nuScenes where EOT-WM surpasses state-of-the-art by approximately 30% in FID and 55% in FVD and can generate novel scenes with self-produced trajectories. The results demonstrate improved realism and controllability in driving scenario simulation, with potential benefits for evaluation, planning, and safety testing of autonomous driving systems.
Abstract
Advanced end-to-end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In this paper, we propose a driving World Model named EOT-WM, unifying Ego-Other vehicle Trajectories in videos for driving simulation. Specifically, it remains a challenge to match multiple trajectories in the BEV space with each vehicle in the video to control the video generation. We first project ego-other vehicle trajectories in the BEV space into the image coordinate for vehicle-trajectory match via pixel positions. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.
