Table of Contents
Fetching ...

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

Jianbiao Mei, Tao Hu, Xuemeng Yang, Licheng Wen, Yu Yang, Tiantian Wei, Yukai Ma, Min Dou, Botian Shi, Yong Liu

TL;DR

DreamForge tackles the gap in realistic, controllable, long-term driving-scene video generation. It introduces perspective guidance and object-wise position encoding to improve street and foreground fidelity, and motion-aware temporal attention to preserve coherence across frames. An autoregressive diffusion pipeline enables generation of long videos from models trained on short sequences. The method integrates with the DriveArena simulator to support robust open-loop and closed-loop evaluations of vision-based driving agents.

Abstract

Recent advances in diffusion models have improved controllable streetscape generation and supported downstream perception and planning tasks. However, challenges remain in accurately modeling driving scenes and generating long videos. To alleviate these issues, we propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and integrate object-wise position encoding to incorporate local 3D correlation and improve foreground object modeling. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos. By leveraging motion frames and an autoregressive generation paradigm,we can autoregressively generate long videos (over 200 frames) using a model trained in short sequences, achieving superior quality compared to the baseline in 16-frame video evaluations. Finally, we integrate our method with the realistic simulator DriveArena to provide more reliable open-loop and closed-loop evaluations for vision-based driving agents. Project Page: https://pjlab-adg.github.io/DriveArena/dreamforge.

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

TL;DR

DreamForge tackles the gap in realistic, controllable, long-term driving-scene video generation. It introduces perspective guidance and object-wise position encoding to improve street and foreground fidelity, and motion-aware temporal attention to preserve coherence across frames. An autoregressive diffusion pipeline enables generation of long videos from models trained on short sequences. The method integrates with the DriveArena simulator to support robust open-loop and closed-loop evaluations of vision-based driving agents.

Abstract

Recent advances in diffusion models have improved controllable streetscape generation and supported downstream perception and planning tasks. However, challenges remain in accurately modeling driving scenes and generating long videos. To alleviate these issues, we propose DreamForge, an advanced diffusion-based autoregressive video generation model tailored for 3D-controllable long-term generation. To enhance the lane and foreground generation, we introduce perspective guidance and integrate object-wise position encoding to incorporate local 3D correlation and improve foreground object modeling. We also propose motion-aware temporal attention to capture motion cues and appearance changes in videos. By leveraging motion frames and an autoregressive generation paradigm,we can autoregressively generate long videos (over 200 frames) using a model trained in short sequences, achieving superior quality compared to the baseline in 16-frame video evaluations. Finally, we integrate our method with the realistic simulator DriveArena to provide more reliable open-loop and closed-loop evaluations for vision-based driving agents. Project Page: https://pjlab-adg.github.io/DriveArena/dreamforge.
Paper Structure (31 sections, 7 equations, 15 figures, 11 tables)

This paper contains 31 sections, 7 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: (a) Overall framework. During the denoising process, DreamForge leverages various conditions to enhance the modeling of driving scenes. Additionally, we introduce perspective guidance and incorporate object-wise position encoding (OPE) to improve street and foreground generation. We also implement motion-aware attention (MTA) to enhance temporal coherence, supporting long-term video generation through autoregression. "P" denotes the perspective projection. (b) The overall procedure of OPE. We only encode frustum sampling points in the 3D bounding boxes into the object position embedding. (c) The detailed architecture of MTA, which learns motion cues from motion frames, ego poses, and bidirectional feature differences.
  • Figure 2: Visual Comparison. Our DreamForge produces more geometrically accurate images due to the perspective guidance.
  • Figure 3: The closed-loop simulation platform DriveArena yang2024drivearena utilizes LimSim wenl2023limsim to parse HD maps, manage traffic flow, detect collisions, and generate road layouts, vehicle boxes, and ego poses for driving scene generation. We upgrade the Wolrd Dreamer with our DreamForge for better temporal coherence.
  • Figure 4: Validation results of map-view segmentation for vehicles (a) and road (b) during the training procedure of CVT zhou2022cross.
  • Figure 5: Visual comparison of foreground generation. The illustrations demonstrate that our DreamForge achieves better foreground object generation. Please see the Appendix for more cases.
  • ...and 10 more figures