MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control

Yining Yao; Xi Guo; Chenjing Ding; Wei Wu

MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control

Yining Yao, Xi Guo, Chenjing Ding, Wei Wu

TL;DR

MyGo tackles the challenge of generating high-quality, camera-controllable multi-view driving videos by injecting onboard camera motion into a pre-trained video diffusion model via a ControlNet-like module. It represents camera parameters with Plücker embeddings and enforces cross-view coherence through epipolar-geometry guided neighbor-view attention, simultaneously preserving the pre-trained model's capabilities. The approach achieves state-of-the-art results on nuScenes for multi-view driving video generation and superior camera controllability on RealEstate10K, with ablations validating the contributions of camera injection and epipolar constraints. This work advances autonomous-driving simulation by enabling precise ego-vehicle motion control and consistent multi-view synthesis, facilitating more accurate environment modeling and training data generation.

Abstract

High-quality driving video generation is crucial for providing training data for autonomous driving models. However, current generative models rarely focus on enhancing camera motion control under multi-view tasks, which is essential for driving video generation. Therefore, we propose MyGo, an end-to-end framework for video generation, introducing motion of onboard cameras as conditions to make progress in camera controllability and multi-view consistency. MyGo employs additional plug-in modules to inject camera parameters into the pre-trained video diffusion model, which retains the extensive knowledge of the pre-trained model as much as possible. Furthermore, we use epipolar constraints and neighbor view information during the generation process of each view to enhance spatial-temporal consistency. Experimental results show that MyGo has achieved state-of-the-art results in both general camera-controlled video generation and multi-view driving video generation tasks, which lays the foundation for more accurate environment simulation in autonomous driving. Project page: https://metadrivescape.github.io/papers_project/MyGo/page.html

MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control

TL;DR

Abstract

Paper Structure (21 sections, 8 equations, 5 figures, 3 tables)

This paper contains 21 sections, 8 equations, 5 figures, 3 tables.

Introduction
Related Works
Multi-view Video Generation
Camera Controlled Video Generation
Method
Preliminary
Overview
Representing Camera Condition
Integrate Camera Pose into Video Generator
Use Neighbor View Information with Camera Condition to Enhance Multi-view Consistency
Experiments
Experiment Details
Datasets
Evaluation Metrics
Implement Details
...and 6 more sections

Figures (5)

Figure 1: Examples of the generated multi-view video frames. MyGo is capable of generating multi-view videos precisely controlled by onboard camera parameters as well as road structural information while maintaining excellent temporal consistency as well as cross-view long-term spatial consistency.
Figure 2: MyGo takes BEV map, 3D bounding boxes, neighbour view, keyframes and camera parameters as conditions, and uses a unified encoder to process the conditions. The encoded conditions are further integrated into U-Net by a condition cross-attention block. We design a ControlNet like structure to inject camera plücker coordinates into pre-trained U-Net blocks. Moreover, in neighbour view cross-attention block, we use epipolar geometry as a constraint to guide the calculation of cross-attention.
Figure 3: Visualization of attention mask based on epipolar geometry. During the neighbour view cross-attention process, a mask is computed so that the right side of the left neighbor and the left side of the right neighbor are included in the calculation, while other parts are ignored
Figure 4: Result of the case of changing to another line, demonstrating that our method can edit the ego vehicle's motion while maintaining generation quality and spatial-temporal consistency
Figure 5: Experiments on RealEstate10K, including generation results of several camera trajectories and comparison with baseline methods. Content in red boxes shows how our methods outperforms baselines in camera controllability.

MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control

TL;DR

Abstract

MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control

Authors

TL;DR

Abstract

Table of Contents

Figures (5)