Table of Contents
Fetching ...

DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition

Ji Li, Zhiwei Li, Shihao Li, Zhenjiang Yu, Boyang Wang, Haiou Liu

TL;DR

DiffPlace addresses the need for background-consistent, place-aware street-view generation to improve visual place recognition. It introduces a place-ID controller that maps place-ID embeddings into the CLIP space via linear projection, a perceiver transformer, and a SoftCLIP contrastive loss to enable place-controllable multi-view synthesis. The framework preserves scene background while allowing foreground and weather variations, and its synthetic data substantially improves place recognition and 3D object detection on nuScenes and Pitts30k. This work demonstrates the value of integrating place-conditioned diffusion with contrastive alignment to enhance autonomous driving perception systems.

Abstract

Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving

DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition

TL;DR

DiffPlace addresses the need for background-consistent, place-aware street-view generation to improve visual place recognition. It introduces a place-ID controller that maps place-ID embeddings into the CLIP space via linear projection, a perceiver transformer, and a SoftCLIP contrastive loss to enable place-controllable multi-view synthesis. The framework preserves scene background while allowing foreground and weather variations, and its synthetic data substantially improves place recognition and 3D object detection on nuScenes and Pitts30k. This work demonstrates the value of integrating place-conditioned diffusion with contrastive alignment to enhance autonomous driving perception systems.

Abstract

Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving
Paper Structure (23 sections, 9 equations, 12 figures, 4 tables)

This paper contains 23 sections, 9 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Systematic depiction of proposed DiffPlace. (a) Generation is seen as the reverse process of perception, which generates images with the input of place-ID embeddings, bounding boxes, etc. We are the first to achieve data closed-loop for visual place recognition augmentation training. (b) The original place recognition model mistakenly focused on foreground objects and clouds in the sky. We corrected the place recognition model by augmented training on "car left" and "weather changed" situations through DiffPlace.
  • Figure 2: Overview of the DiffPlace pipeline. The input scene representation $S = \{ \textit{MAP, BOX, TEXT, PLACE\_ID} \}$ is processed by dedicated encoders: $E_{\text{map}}$, $E_{\text{text}}$, $E_{\text{place}}$, $E_{\text{cam}}$, and $E_{\text{box}}$. The resulting encoded features are concatenated and fed into the U-Net via cross-attention mechanisms to generate multi-view consistent images with controllable background and foreground elements. An optional visual place recognition network (MixVPR MixVPR) is utilized to extract place-ID embeddings, enabling enhanced place-aware synthesis.
  • Figure 3: Details of the proposed place-ID controller. (a) Place-ID embeddings are projected via trainable linear layers to align with other conditions; (b) Attribute perceiver transformer interacts place-ID embeddings $Z$ with CLIP image features $c_I$; (c) A contrastive loss $\mathcal{L}_{\text{SoftCLIP}}$ is applied to align place-ID embeddings with the CLIP latent space.
  • Figure 4: Realism and controllability validation. Our method demonstrates significantly better control over place features, particularly in the background, compared to BEVGen, MagicDrive and DualDiff. We highlight some background areas in low-quality (yellow) and fail-to-generate (red) for comparison. All scenes are from the nuScenes validation set.
  • Figure 6: Place-ID embedding contributed cross-attention visualization in generation process.
  • ...and 7 more figures