Spherical Dense Text-to-Image Synthesis
Timon Winter, Stanislav Frolov, Brian Bernhard Moser, Andreas Dengel
TL;DR
This work addresses the lack of a unified framework for spherical dense text-to-image synthesis by marrying dense-text prompts with 360° panoramas. It presents two plug-and-play methods, MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF), that fuse training-free dense diffusion with panoramic backbones StitchDiffusion and PanFusion, and introduces Dense-Synthetic-View (DSynView) to benchmark SDT2I. The study finds that MSTD generally matches baselines in image quality and layout adherence, while MPF yields more diverse scenes but introduces foreground artifacts, prompting bootstrapping and EPPA-based improvements. The results underscore trade-offs between fidelity and diversity in SDT2I and highlight practical tuning strategies for better prompt/layout adherence in spherical spaces, with DSynView enabling robust evaluation for future work.
Abstract
Recent advancements in text-to-image (T2I) have improved synthesis results, but challenges remain in layout control and generating omnidirectional panoramic images. Dense T2I (DT2I) and spherical T2I (ST2I) models address these issues, but so far no unified approach exists. Trivial approaches, like prompting a DT2I model to generate panoramas can not generate proper spherical distortions and seamless transitions at the borders. Our work shows that spherical dense text-to-image (SDT2I) can be achieved by integrating training-free DT2I approaches into finetuned panorama models. Specifically, we propose MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF) by integrating MultiDiffusion into StitchDiffusion and PanFusion, respectively. Since no benchmark for SDT2I exists, we further construct Dense-Synthetic-View (DSynView), a new synthetic dataset containing spherical layouts to evaluate our models. Our results show that MSTD outperforms MPF across image quality as well as prompt- and layout adherence. MultiPanFusion generates more diverse images but struggles to synthesize flawless foreground objects. We propose bootstrap-coupling and turning off equirectangular perspective-projection attention in the foreground as an improvement of MPF. Link to code https://github.com/sdt2i/spherical-dense-text-to-image
