Table of Contents
Fetching ...

From Bird's-Eye to Street View: Crafting Diverse and Condition-Aligned Images with Latent Diffusion Model

Xiaojie Xu, Tianshuo Xu, Fulong Ma, Yingcong Chen

TL;DR

A practical framework for generating images from a BEV layout that leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.

Abstract

We explore Bird's-Eye View (BEV) generation, converting a BEV map into its corresponding multi-view street images. Valued for its unified spatial representation aiding multi-sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street-view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion-based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high-quality, and condition-aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine-tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi-view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine-tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.

From Bird's-Eye to Street View: Crafting Diverse and Condition-Aligned Images with Latent Diffusion Model

TL;DR

A practical framework for generating images from a BEV layout that leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.

Abstract

We explore Bird's-Eye View (BEV) generation, converting a BEV map into its corresponding multi-view street images. Valued for its unified spatial representation aiding multi-sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street-view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion-based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high-quality, and condition-aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine-tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi-view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine-tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.
Paper Structure (12 sections, 4 equations, 7 figures, 2 tables)

This paper contains 12 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: From a bird's-eye view semantic map, our framework is capable of generating high-quality and varied camera view images. In terms of map elements, our results closely match the ground truth images. The red boxes (seen in the left four images) represent vehicles, while the yellow lines (in the right two images) delineate road contours.
  • Figure 2: Our two-staged pipeline. Initially, a BEV map is projected and refined to produce semantic maps from the camera's perspective. These semantic maps, paired with the prompt, are then fed into a pretrained U-Net for iterative denoising. We've incorporated street-view adaptation layers into the network to ensure style and viewpoint alignment.
  • Figure 3: The impact of shape refinement on the final image generation is evident. Without refinement, the resulting image (left) resembles a cube. In contrast, the refined version (right) exhibits a more natural form.
  • Figure 4: We incorporate viewpoints into our foundational diffusion model by integrating specific views into the text prompts, resulting in distinct View Adaptation Layers. During sampling from the model, we can generate images from a designated camera by invoking its learned novel prompt.
  • Figure 5: We compare our method (left) with UViT (middle) bao2023all and BevGen(right) swerdlow2023street. Our results demonstrate greater stability and more effective use of conditional information, especially in the highlighted yellow regions where the condition should take effect. For best results, it is recommended to zoom in.
  • ...and 2 more figures