Retrieval-guided Cross-view Image Synthesis
Hongji Yang, Yiru Li, Yingying Zhu
TL;DR
This work addresses cross-view image synthesis by replacing semantic maps and preprocessing with a retrieval-guided framework that learns view-invariant semantics via a contrastively trained embedder. A two-network system maps noise and source-view information into a shared embedding space and fuses retrieved semantics with style through a two-stage generator, including Attentional AdaIN and region-aware modulation. The authors introduce VIGOR-GEN, an urban-focused dataset, and demonstrate state-of-the-art results on CVUSA, CVACT, and VIGOR-GEN across multiple metrics, highlighting improvements in realism and cross-view consistency. The approach advances the coupling of information retrieval and image synthesis, enabling efficient, high-fidelity cross-view generation suitable for real-world urban scenarios and retrieval-based applications.
Abstract
Information retrieval techniques have demonstrated exceptional capabilities in identifying semantic similarities across diverse domains through robust feature representations. However, their potential in guiding synthesis tasks, particularly cross-view image synthesis, remains underexplored. Cross-view image synthesis presents significant challenges in establishing reliable correspondences between drastically different viewpoints. To address this, we propose a novel retrieval-guided framework that reimagines how retrieval techniques can facilitate effective cross-view image synthesis. Unlike existing methods that rely on auxiliary information, such as semantic segmentation maps or preprocessing modules, our retrieval-guided framework captures semantic similarities across different viewpoints, trained through contrastive learning to create a smooth embedding space. Furthermore, a novel fusion mechanism leverages these embeddings to guide image synthesis while learning and encoding both view-invariant and view-specific features. To further advance this area, we introduce VIGOR-GEN, a new urban-focused dataset with complex viewpoint variations in real-world scenarios. Extensive experiments demonstrate that our retrieval-guided approach significantly outperforms existing methods on the CVUSA, CVACT and VIGOR-GEN datasets, particularly in retrieval accuracy (R@1) and synthesis quality (FID). Our work bridges information retrieval and synthesis tasks, offering insights into how retrieval techniques can address complex cross-domain synthesis challenges.
