Geometry-guided Cross-view Diffusion for One-to-many Cross-view Image Synthesis
Tao Jun Lin, Wenqing Wang, Yujiao Shi, Akhil Perincherry, Ankit Vora, Hongdong Li
TL;DR
This work tackles cross-view image synthesis between satellite and ground perspectives, a fundamentally ill-posed, one-to-many problem. It introduces Geometry-guided Cross-view Conditioning (GCC) within a Latent Diffusion Model to encode explicit geometric correspondences via Cross-View Geometry Projection and multi-level feature aggregation, enabling diverse and geometrically consistent Sat2Grd and Grd2Sat outputs. The approach achieves state-of-the-art performance on KITTI, CVUSA, and CVACT datasets, demonstrating superior fidelity and diversity while avoiding reliance on extra geometry maps. The method has practical implications for data augmentation, virtual reality, and cross-view localization, with further potential from incorporating additional modalities and multi-dataset learning.
Abstract
This paper presents a novel approach for cross-view synthesis aimed at generating plausible ground-level images from corresponding satellite imagery or vice versa. We refer to these tasks as satellite-to-ground (Sat2Grd) and ground-to-satellite (Grd2Sat) synthesis, respectively. Unlike previous works that typically focus on one-to-one generation, producing a single output image from a single input image, our approach acknowledges the inherent one-to-many nature of the problem. This recognition stems from the challenges posed by differences in illumination, weather conditions, and occlusions between the two views. To effectively model this uncertainty, we leverage recent advancements in diffusion models. Specifically, we exploit random Gaussian noise to represent the diverse possibilities learnt from the target view data. We introduce a Geometry-guided Cross-view Condition (GCC) strategy to establish explicit geometric correspondences between satellite and street-view features. This enables us to resolve the geometry ambiguity introduced by camera pose between image pairs, boosting the performance of cross-view image synthesis. Through extensive quantitative and qualitative analyses on three benchmark cross-view datasets, we demonstrate the superiority of our proposed geometry-guided cross-view condition over baseline methods, including recent state-of-the-art approaches in cross-view image synthesis. Our method generates images of higher quality, fidelity, and diversity than other state-of-the-art approaches.
