Table of Contents
Fetching ...

DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment

Xiaofan Li, Chenming Wu, Zhao Yang, Zhihao Xu, Dingkang Liang, Yumeng Zhang, Ji Wan, Jun Wang

TL;DR

DriVerse addresses the challenge of generating long-horizon driving videos that faithfully follow a given trajectory from a single image. It introduces multimodal trajectory prompting (MTP), which encodes 3D trajectories into language tokens and trajectory-guided spatial anchors, along with latent motion alignment (LMA) to enforce inter-frame consistency and a dynamic window generation (DWG) strategy to maintain coherence during sharp heading changes. The approach yields state-of-the-art results on nuScenes and Waymo Open Dataset with limited training data, demonstrated through both perceptual metrics and a geometric trajectory-alignment evaluation. This work provides a robust, trajectory-conditioned simulation framework with practical implications for evaluating and training autonomous driving systems.

Abstract

This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low-fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine-grained, trajectory-specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter-frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. The code and models will be released to the public.

DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment

TL;DR

DriVerse addresses the challenge of generating long-horizon driving videos that faithfully follow a given trajectory from a single image. It introduces multimodal trajectory prompting (MTP), which encodes 3D trajectories into language tokens and trajectory-guided spatial anchors, along with latent motion alignment (LMA) to enforce inter-frame consistency and a dynamic window generation (DWG) strategy to maintain coherence during sharp heading changes. The approach yields state-of-the-art results on nuScenes and Waymo Open Dataset with limited training data, demonstrated through both perceptual metrics and a geometric trajectory-alignment evaluation. This work provides a robust, trajectory-conditioned simulation framework with practical implications for evaluating and training autonomous driving systems.

Abstract

This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. Previous autonomous driving world models either directly feed the trajectory or discrete control signals into the generation pipeline, leading to poor alignment between the control inputs and the implicit features of the 2D base generative model, which results in low-fidelity video outputs. Some methods use coarse textual commands or discrete vehicle control signals, which lack the precision to guide fine-grained, trajectory-specific video generation, making them unsuitable for evaluating actual autonomous driving algorithms. DriVerse introduces explicit trajectory guidance in two complementary forms: it tokenizes trajectories into textual prompts using a predefined trend vocabulary for seamless language integration, and converts 3D trajectories into 2D spatial motion priors to enhance control over static content within the driving scene. To better handle dynamic objects, we further introduce a lightweight motion alignment module, which focuses on the inter-frame consistency of dynamic pixels, significantly enhancing the temporal coherence of moving elements over long sequences. With minimal training and no need for additional data, DriVerse outperforms specialized models on future video generation tasks across both the nuScenes and Waymo datasets. The code and models will be released to the public.

Paper Structure

This paper contains 13 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our proposed navigation world model for driving simulation, referred to as DriVerse, is designed to transform a single input image along with various navigation trajectories into high-quality videos that accurately reflect the intended motion, ensuring that the generated videos maintain a strong alignment with real-world driving scenarios, which significantly enhances the realism and utility of driving simulations.
  • Figure 2: Overview of the DriVerse framework. Given a single scene image and a future trajectory, DriVerse decomposes the generation task into static alignment and motion alignment. The former, referred to as Multimodal Trajectory Prompting (MTP), encodes future trajectories into textual prompts using a predefined trend vocabulary, and further injects spatial motion priors derived from 3D static anchors into the frozen backbone via a trainable control module. The latter, called Latent Motion Alignment (LMA), supervises the generation by enforcing consistency between generated and ground-truth dynamic pixels, based on offline-computed motion correspondences.
  • Figure 3: The top and bottom parts of the figure visualize, through pixel-level tracking, the respective influences of object motion and camera motion on image pixels. The red-circled regions highlight cases where a large change in the ego vehicle's heading angle leads to a reduction in the number of initialized static points.
  • Figure 4: Qualitative comparison with existing methods. Top: Visualization adapted from the original Vista paper. We input only the first frame to DriVerse, which is capable of generating high-quality, long-horizon future predictions. Bottom: Comparison between DriVerse and Vista. White dashed circles indicate regions of implausible generation.
  • Figure 5: Qualitative comparison of inference results on the Waymo Open Dataset. The top two rows show the diverse future predictions (Future A/B) generated by the full DriVerse model. The third row presents the results of the model without the LMA module, where red dashed circles highlight regions with inconsistent or implausible motion. The fourth row shows the generation results of a base image-to-video (I2V) model applied directly to street scenes, where white dashed circles and arrows indicate noticeable visual artifacts.