Table of Contents
Fetching ...

MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control

Yining Yao, Xi Guo, Chenjing Ding, Wei Wu

TL;DR

MyGo tackles the challenge of generating high-quality, camera-controllable multi-view driving videos by injecting onboard camera motion into a pre-trained video diffusion model via a ControlNet-like module. It represents camera parameters with Plücker embeddings and enforces cross-view coherence through epipolar-geometry guided neighbor-view attention, simultaneously preserving the pre-trained model's capabilities. The approach achieves state-of-the-art results on nuScenes for multi-view driving video generation and superior camera controllability on RealEstate10K, with ablations validating the contributions of camera injection and epipolar constraints. This work advances autonomous-driving simulation by enabling precise ego-vehicle motion control and consistent multi-view synthesis, facilitating more accurate environment modeling and training data generation.

Abstract

High-quality driving video generation is crucial for providing training data for autonomous driving models. However, current generative models rarely focus on enhancing camera motion control under multi-view tasks, which is essential for driving video generation. Therefore, we propose MyGo, an end-to-end framework for video generation, introducing motion of onboard cameras as conditions to make progress in camera controllability and multi-view consistency. MyGo employs additional plug-in modules to inject camera parameters into the pre-trained video diffusion model, which retains the extensive knowledge of the pre-trained model as much as possible. Furthermore, we use epipolar constraints and neighbor view information during the generation process of each view to enhance spatial-temporal consistency. Experimental results show that MyGo has achieved state-of-the-art results in both general camera-controlled video generation and multi-view driving video generation tasks, which lays the foundation for more accurate environment simulation in autonomous driving. Project page: https://metadrivescape.github.io/papers_project/MyGo/page.html

MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control

TL;DR

MyGo tackles the challenge of generating high-quality, camera-controllable multi-view driving videos by injecting onboard camera motion into a pre-trained video diffusion model via a ControlNet-like module. It represents camera parameters with Plücker embeddings and enforces cross-view coherence through epipolar-geometry guided neighbor-view attention, simultaneously preserving the pre-trained model's capabilities. The approach achieves state-of-the-art results on nuScenes for multi-view driving video generation and superior camera controllability on RealEstate10K, with ablations validating the contributions of camera injection and epipolar constraints. This work advances autonomous-driving simulation by enabling precise ego-vehicle motion control and consistent multi-view synthesis, facilitating more accurate environment modeling and training data generation.

Abstract

High-quality driving video generation is crucial for providing training data for autonomous driving models. However, current generative models rarely focus on enhancing camera motion control under multi-view tasks, which is essential for driving video generation. Therefore, we propose MyGo, an end-to-end framework for video generation, introducing motion of onboard cameras as conditions to make progress in camera controllability and multi-view consistency. MyGo employs additional plug-in modules to inject camera parameters into the pre-trained video diffusion model, which retains the extensive knowledge of the pre-trained model as much as possible. Furthermore, we use epipolar constraints and neighbor view information during the generation process of each view to enhance spatial-temporal consistency. Experimental results show that MyGo has achieved state-of-the-art results in both general camera-controlled video generation and multi-view driving video generation tasks, which lays the foundation for more accurate environment simulation in autonomous driving. Project page: https://metadrivescape.github.io/papers_project/MyGo/page.html
Paper Structure (21 sections, 8 equations, 5 figures, 3 tables)

This paper contains 21 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of the generated multi-view video frames. MyGo is capable of generating multi-view videos precisely controlled by onboard camera parameters as well as road structural information while maintaining excellent temporal consistency as well as cross-view long-term spatial consistency.
  • Figure 2: MyGo takes BEV map, 3D bounding boxes, neighbour view, keyframes and camera parameters as conditions, and uses a unified encoder to process the conditions. The encoded conditions are further integrated into U-Net by a condition cross-attention block. We design a ControlNet like structure to inject camera plücker coordinates into pre-trained U-Net blocks. Moreover, in neighbour view cross-attention block, we use epipolar geometry as a constraint to guide the calculation of cross-attention.
  • Figure 3: Visualization of attention mask based on epipolar geometry. During the neighbour view cross-attention process, a mask is computed so that the right side of the left neighbor and the left side of the right neighbor are included in the calculation, while other parts are ignored
  • Figure 4: Result of the case of changing to another line, demonstrating that our method can edit the ego vehicle's motion while maintaining generation quality and spatial-temporal consistency
  • Figure 5: Experiments on RealEstate10K, including generation results of several camera trajectories and comparison with baseline methods. Content in red boxes shows how our methods outperforms baselines in camera controllability.