Generating Multimodal Driving Scenes via Next-Scene Prediction
Yanhao Wu, Haoyang Zhang, Tianwei Lin, Lichao Huang, Shujie Luo, Rui Wu, Congpei Qiu, Wei Ke, Tong Zhang
TL;DR
The paper addresses the need for diverse, long-horizon driving scene generation across multiple modalities. It introduces UMGen, a four-modal framework that combines Ego-action, Map, Agents, and Images, using a two-stage autoregressive approach: TAR models inter-frame dynamics while OAR ensures intra-frame modality coherence; AMA aligns map features with ego-action to maintain cross-modal consistency. Key contributions include integrating four modalities, a computationally efficient TAR/OAR architecture, and the AMA module for action-aware map alignment, demonstrated on up to 60-second sequences with user-guided control and ablations confirming component effectiveness. The work advances closed-loop autonomous driving simulations by enabling interactive, multimodal scene generation with improved realism and controllability, with diffusion-based refinement discussed as a potential enhancement for future work.
Abstract
Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements. Project page: https://yanhaowu.github.io/UMGen/
