Table of Contents
Fetching ...

Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis

Tao Jun Lin, Wenqing Wang, Yujiao Shi, Akhil Perincherry, Ankit Vora, Hongdong Li

TL;DR

This work tackles cross-view image synthesis between satellite and ground perspectives, a fundamentally ill-posed, one-to-many problem. It introduces Geometry-guided Cross-view Conditioning (GCC) within a Latent Diffusion Model to encode explicit geometric correspondences via Cross-View Geometry Projection and multi-level feature aggregation, enabling diverse and geometrically consistent Sat2Grd and Grd2Sat outputs. The approach achieves state-of-the-art performance on KITTI, CVUSA, and CVACT datasets, demonstrating superior fidelity and diversity while avoiding reliance on extra geometry maps. The method has practical implications for data augmentation, virtual reality, and cross-view localization, with further potential from incorporating additional modalities and multi-dataset learning.

Abstract

This paper presents a novel approach for cross-view synthesis aimed at generating plausible ground-level images from corresponding satellite imagery or vice versa. We refer to these tasks as satellite-to-ground (Sat2Grd) and ground-to-satellite (Grd2Sat) synthesis, respectively. Unlike previous works that typically focus on one-to-one generation, producing a single output image from a single input image, our approach acknowledges the inherent one-to-many nature of the problem. This recognition stems from the challenges posed by differences in illumination, weather conditions, and occlusions between the two views. To effectively model this uncertainty, we leverage recent advancements in diffusion models. Specifically, we exploit random Gaussian noise to represent the diverse possibilities learnt from the target view data. We introduce a Geometry-guided Cross-view Condition (GCC) strategy to establish explicit geometric correspondences between satellite and street-view features. This enables us to resolve the geometry ambiguity introduced by camera pose between image pairs, boosting the performance of cross-view image synthesis. Through extensive quantitative and qualitative analyses on three benchmark cross-view datasets, we demonstrate the superiority of our proposed geometry-guided cross-view condition over baseline methods, including recent state-of-the-art approaches in cross-view image synthesis. Our method generates images of higher quality, fidelity, and diversity than other state-of-the-art approaches.

Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis

TL;DR

This work tackles cross-view image synthesis between satellite and ground perspectives, a fundamentally ill-posed, one-to-many problem. It introduces Geometry-guided Cross-view Conditioning (GCC) within a Latent Diffusion Model to encode explicit geometric correspondences via Cross-View Geometry Projection and multi-level feature aggregation, enabling diverse and geometrically consistent Sat2Grd and Grd2Sat outputs. The approach achieves state-of-the-art performance on KITTI, CVUSA, and CVACT datasets, demonstrating superior fidelity and diversity while avoiding reliance on extra geometry maps. The method has practical implications for data augmentation, virtual reality, and cross-view localization, with further potential from incorporating additional modalities and multi-dataset learning.

Abstract

This paper presents a novel approach for cross-view synthesis aimed at generating plausible ground-level images from corresponding satellite imagery or vice versa. We refer to these tasks as satellite-to-ground (Sat2Grd) and ground-to-satellite (Grd2Sat) synthesis, respectively. Unlike previous works that typically focus on one-to-one generation, producing a single output image from a single input image, our approach acknowledges the inherent one-to-many nature of the problem. This recognition stems from the challenges posed by differences in illumination, weather conditions, and occlusions between the two views. To effectively model this uncertainty, we leverage recent advancements in diffusion models. Specifically, we exploit random Gaussian noise to represent the diverse possibilities learnt from the target view data. We introduce a Geometry-guided Cross-view Condition (GCC) strategy to establish explicit geometric correspondences between satellite and street-view features. This enables us to resolve the geometry ambiguity introduced by camera pose between image pairs, boosting the performance of cross-view image synthesis. Through extensive quantitative and qualitative analyses on three benchmark cross-view datasets, we demonstrate the superiority of our proposed geometry-guided cross-view condition over baseline methods, including recent state-of-the-art approaches in cross-view image synthesis. Our method generates images of higher quality, fidelity, and diversity than other state-of-the-art approaches.

Paper Structure

This paper contains 33 sections, 12 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Our proposed Geometry-Guided Conditioning method for cross-view image synthesis (a) and visualization examples generated by our proposed Geometry-guided Cross-view Diffusion. On the bottom left (b) are images generated from our Sat2Grd model, on the bottom right (c) are image generated from our Grd2Sat model.
  • Figure 2: An overview of the proposed Cross-view Image Synthesis Pipeline. When provided with either a satellite image patch or a street-view image, the model employs a feature extractor $\mathcal{F}$ and our Geometry projection Module to construct our Geometry-guided Cross-view Conditions(GCC). The Latent Diffusion Pipeline learns to model cross-view data distribution from a Gaussian noise latent, under the guidance of our proposed GCC. The ControlNet module takes GCC as input and fine-tunes LoRA layers.
  • Figure 3: Ablation for sample generated by our LDM model and our ControlNet model given the same condition, on Sat2Grd task.
  • Figure 4: Example of generated images by different methods in Sat2Grd image synthesis task, on the CVACT (Aligned) and CVUSA datasets.
  • Figure 5: Example of generated images by different methods in Grd2Sat image synthesis task, on the CVACT (Aligned) dataset.
  • ...and 4 more figures