Table of Contents
Fetching ...

MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

Buyu Liu, Kai Wang, Yansong Liu, Jun Bao, Tingting Han, Jun Yu

TL;DR

MVPbev tackles cross-view RGB generation from BEV semantics and text prompts with a two-stage pipeline: first, semantically consistent projection of BEV maps to multiple perspective views using camera geometry; second, cross-view image generation with a multi-view attention module and a training-time noise initialization/de-noising scheme within a latent diffusion framework. The method enforces both global semantic consistency and local visual coherence across overlapping fields of view, and enables test-time instance-level controllability. Evaluations on NuScenes demonstrate strong quantitative and qualitative performance, superior cross-view consistency, and notable generalizability to unseen viewpoints, supported by comprehensive human analyses. This approach advances controllable, data-efficient multi-view synthesis for autonomous driving and related applications.

Abstract

This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or are incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, allowing object-level control and novel view generation at test-time. Specifically, MVPbev firstly projects given BEV semantics to perspective view with camera parameters, empowering the model to generalize to unseen view points. Then we introduce a multi-view attention module where special initialization and de-noising processes are introduced to explicitly enforce local consistency among overlapping views w.r.t. cross-view homography. Last but not least, MVPbev further allows test-time instance-level controllability by refining a pre-trained text-to-image diffusion model. Our extensive experiments on NuScenes demonstrate that our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples, surpassing the state-of-the-art methods under various evaluation metrics. We further demonstrate the advances of our method in terms of generalizability and controllability with the help of novel evaluation metrics and comprehensive human analysis. Our code, data, and model can be found in \url{https://github.com/kkaiwwana/MVPbev}.

MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

TL;DR

MVPbev tackles cross-view RGB generation from BEV semantics and text prompts with a two-stage pipeline: first, semantically consistent projection of BEV maps to multiple perspective views using camera geometry; second, cross-view image generation with a multi-view attention module and a training-time noise initialization/de-noising scheme within a latent diffusion framework. The method enforces both global semantic consistency and local visual coherence across overlapping fields of view, and enables test-time instance-level controllability. Evaluations on NuScenes demonstrate strong quantitative and qualitative performance, superior cross-view consistency, and notable generalizability to unseen viewpoints, supported by comprehensive human analyses. This approach advances controllable, data-efficient multi-view synthesis for autonomous driving and related applications.

Abstract

This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or are incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, allowing object-level control and novel view generation at test-time. Specifically, MVPbev firstly projects given BEV semantics to perspective view with camera parameters, empowering the model to generalize to unseen view points. Then we introduce a multi-view attention module where special initialization and de-noising processes are introduced to explicitly enforce local consistency among overlapping views w.r.t. cross-view homography. Last but not least, MVPbev further allows test-time instance-level controllability by refining a pre-trained text-to-image diffusion model. Our extensive experiments on NuScenes demonstrate that our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples, surpassing the state-of-the-art methods under various evaluation metrics. We further demonstrate the advances of our method in terms of generalizability and controllability with the help of novel evaluation metrics and comprehensive human analysis. Our code, data, and model can be found in \url{https://github.com/kkaiwwana/MVPbev}.
Paper Structure (10 sections, 2 equations, 9 figures, 1 table)

This paper contains 10 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 2: MVPbev consists of two stages. The first stage projects BEV semantics to perspective view with camera parameters to maintain global semantic consistency. The second stage parses both perspective semantics and text prompts, and generates multi-view images with both visual consistency and test-time instance-level control by explicit enforcing in latent.
  • Figure 3: We visualize our BEV project process. Given a BEV semantic map $\textit{B}$, we project it to multiple perspective views. We overlay the semantics on the original RGB images in perspective view for better comparison.
  • Figure 4: Our multi-view attention module implicitly exploits the cross-view consistency by aggregating information from the target feature pixels in neighbour views.
  • Figure 5: We explicitly enforce that the noise value of pixels at overlapping FOV should be consistent across views.
  • Figure 6: Our MVPbev is able to achieve the most visual and semantic consistent images w.r.t. input control signal. We highlight the overlapping area and road boundary in orange bounding boxes and green lines.
  • ...and 4 more figures