Table of Contents
Fetching ...

Generating Multimodal Driving Scenes via Next-Scene Prediction

Yanhao Wu, Haoyang Zhang, Tianwei Lin, Lichao Huang, Shujie Luo, Rui Wu, Congpei Qiu, Wei Ke, Tong Zhang

TL;DR

The paper addresses the need for diverse, long-horizon driving scene generation across multiple modalities. It introduces UMGen, a four-modal framework that combines Ego-action, Map, Agents, and Images, using a two-stage autoregressive approach: TAR models inter-frame dynamics while OAR ensures intra-frame modality coherence; AMA aligns map features with ego-action to maintain cross-modal consistency. Key contributions include integrating four modalities, a computationally efficient TAR/OAR architecture, and the AMA module for action-aware map alignment, demonstrated on up to 60-second sequences with user-guided control and ablations confirming component effectiveness. The work advances closed-loop autonomous driving simulations by enabling interactive, multimodal scene generation with improved realism and controllability, with diffusion-based refinement discussed as a potential enhancement for future work.

Abstract

Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements. Project page: https://yanhaowu.github.io/UMGen/

Generating Multimodal Driving Scenes via Next-Scene Prediction

TL;DR

The paper addresses the need for diverse, long-horizon driving scene generation across multiple modalities. It introduces UMGen, a four-modal framework that combines Ego-action, Map, Agents, and Images, using a two-stage autoregressive approach: TAR models inter-frame dynamics while OAR ensures intra-frame modality coherence; AMA aligns map features with ego-action to maintain cross-modal consistency. Key contributions include integrating four modalities, a computationally efficient TAR/OAR architecture, and the AMA module for action-aware map alignment, demonstrated on up to 60-second sequences with user-guided control and ablations confirming component effectiveness. The work advances closed-loop autonomous driving simulations by enabling interactive, multimodal scene generation with improved realism and controllability, with diffusion-based refinement discussed as a potential enhancement for future work.

Abstract

Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements. Project page: https://yanhaowu.github.io/UMGen/

Paper Structure

This paper contains 12 sections, 13 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: An overview of our proposed driving scene generation paradigm $-$ UMGen. Starting from a random initialization (a) UMGen generates ego-centric, multimodal scenes frame-by-frame. Each scene encompasses four modalities: ego-vehicle action, map, traffic agent, and image; (b) UMGen offers multiple functions. It can autonomously generate multimodal scene sequences based solely on its own historical context, but also predict the other modalities based on input ego-vehicle actions provided by users. Furthermore, UMGen can incorporate user-specified agent actions to create customized scene sequences. In three scene sequences, arranged from top to bottom, we demonstrate the ego vehicle autonomously driving straight through an intersection, executing a user-defined right turn, and encountering scenes where a user-specified white car cuts in front of it. For better visualization, a portion of the map corresponding to the user-specified scenario is zoomed in.
  • Figure 2: Pipeline of our UMGen. Given $T$ past frames of multimodal driving scenes, including ego-action, map, traffic agents, and image in each scene, each modality is tokenized into discrete tokens. The token embeddings are then processed through the Ego-action Prediction module, which forecasts the ego-action for $T+1$ time step. Using this predicted ego-action, the AMA module adjusts the map features. Next, the TAR module aggregates temporal information across sequences, while the OAR module ensures sequential modality prediction within each frame by autoregressively generating each token conditioned on the aggregated history information. Finally, the predicted tokens are fed to the decoder to obtain the next scene.
  • Figure 3: Generated multimodal driving scenes by UMGen: The generated scenes evolve continuously from the ego vehicle's perspective. Red Box: ego-vehicle, Green Box: cars, Orange Box: pedestrians or cyclists, Arrow: agent velocities.
  • Figure 4: Generated scenes with input ego actions. The first row shows the interactive control of the ego vehicle to perform left and right turns. The second row shows the ego vehicle, initialized with a right-turn velocity or a gradual deceleration to a stop. Red box: ego-vehicle, green box: vehicle, orange box: pedestrians or cyclists, arrow: agent velocity.
  • Figure 5: Customized scenario generation by UMGen: The first row presents the original scene from the dataset. We assign a forward-left velocity to the vehicle highlighted by the yellow dashed line box and the ego-vehicle spontaneously takes a braking action (second row). Alternatively, we can actively control the ego-vehicle to perform a lane change (third row). Red box: ego-vehicle, green box: vehicle, arrow: agent velocity.
  • ...and 5 more figures