Table of Contents
Fetching ...

Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space

Jian Zhu, Zhengyu Jia, Tian Gao, Jiaxin Deng, Shidi Li, Lang Zhang, Fu Liu, Peng Jia, Xianpeng Lang

TL;DR

This work tackles the limitation of ego-centric driving world models by introducing EOT-WM, a framework that unifies ego and other vehicle trajectories in video space. It converts BEV trajectories into per-vehicle trajectory videos, encodes them alongside driving videos with a Spatial-Temporal Variational Autoencoder, and uses a Trajectory-injected Diffusion Transformer to generate future frames conditioned on initial frames and trajectory guidance. Key contributions include the video-space trajectory representation, aligned latent spaces for motion guidance, and a trajectory controllability metric, all validated on nuScenes where EOT-WM surpasses state-of-the-art by approximately 30% in FID and 55% in FVD and can generate novel scenes with self-produced trajectories. The results demonstrate improved realism and controllability in driving scenario simulation, with potential benefits for evaluation, planning, and safety testing of autonomous driving systems.

Abstract

Advanced end-to-end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In this paper, we propose a driving World Model named EOT-WM, unifying Ego-Other vehicle Trajectories in videos for driving simulation. Specifically, it remains a challenge to match multiple trajectories in the BEV space with each vehicle in the video to control the video generation. We first project ego-other vehicle trajectories in the BEV space into the image coordinate for vehicle-trajectory match via pixel positions. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.

Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space

TL;DR

This work tackles the limitation of ego-centric driving world models by introducing EOT-WM, a framework that unifies ego and other vehicle trajectories in video space. It converts BEV trajectories into per-vehicle trajectory videos, encodes them alongside driving videos with a Spatial-Temporal Variational Autoencoder, and uses a Trajectory-injected Diffusion Transformer to generate future frames conditioned on initial frames and trajectory guidance. Key contributions include the video-space trajectory representation, aligned latent spaces for motion guidance, and a trajectory controllability metric, all validated on nuScenes where EOT-WM surpasses state-of-the-art by approximately 30% in FID and 55% in FVD and can generate novel scenes with self-produced trajectories. The results demonstrate improved realism and controllability in driving scenario simulation, with potential benefits for evaluation, planning, and safety testing of autonomous driving systems.

Abstract

Advanced end-to-end autonomous driving systems predict other vehicles' motions and plan ego vehicle's trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In this paper, we propose a driving World Model named EOT-WM, unifying Ego-Other vehicle Trajectories in videos for driving simulation. Specifically, it remains a challenge to match multiple trajectories in the BEV space with each vehicle in the video to control the video generation. We first project ego-other vehicle trajectories in the BEV space into the image coordinate for vehicle-trajectory match via pixel positions. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.

Paper Structure

This paper contains 17 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The proposed EOT-WM is capable of generating more realistic videos with controllable ego and other vehicle trajectories. These trajectories are represented in video space for EOT-WM instead of BEV space for previous works such as Vista. E. V. Traj. and O. V. Traj. denote ego and other vehicle trajectories, respectively. Novel trajectory means self-produced trajectory not included in the dataset.
  • Figure 2: Illustration of the proposed EOT-WM.
  • Figure 3: Illustration of the original video, other vehicle trajectory (O. V. Traj.) and ego vehicle trajectory (E. V. Traj.) used for the proposed EOT-WM. To be brief, we only visualize the 1st, 7th, 13th, 19th, 25th frames.
  • Figure 4: Representative cases for action controllability achieved by the proposed EOT-WM and Vista on the validation set of nuScenes dataset, where EOT-WM w/o O.V. Traj. means the proposed model without learning other vehicle trajectories.
  • Figure 5: Instance of difference text types used for the proposed EOT-WM, where the scene description and action description are provided in OmniDrive wang2024omnidrive.
  • ...and 2 more figures