Table of Contents
Fetching ...

Spherical Dense Text-to-Image Synthesis

Timon Winter, Stanislav Frolov, Brian Bernhard Moser, Andreas Dengel

TL;DR

This work addresses the lack of a unified framework for spherical dense text-to-image synthesis by marrying dense-text prompts with 360° panoramas. It presents two plug-and-play methods, MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF), that fuse training-free dense diffusion with panoramic backbones StitchDiffusion and PanFusion, and introduces Dense-Synthetic-View (DSynView) to benchmark SDT2I. The study finds that MSTD generally matches baselines in image quality and layout adherence, while MPF yields more diverse scenes but introduces foreground artifacts, prompting bootstrapping and EPPA-based improvements. The results underscore trade-offs between fidelity and diversity in SDT2I and highlight practical tuning strategies for better prompt/layout adherence in spherical spaces, with DSynView enabling robust evaluation for future work.

Abstract

Recent advancements in text-to-image (T2I) have improved synthesis results, but challenges remain in layout control and generating omnidirectional panoramic images. Dense T2I (DT2I) and spherical T2I (ST2I) models address these issues, but so far no unified approach exists. Trivial approaches, like prompting a DT2I model to generate panoramas can not generate proper spherical distortions and seamless transitions at the borders. Our work shows that spherical dense text-to-image (SDT2I) can be achieved by integrating training-free DT2I approaches into finetuned panorama models. Specifically, we propose MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF) by integrating MultiDiffusion into StitchDiffusion and PanFusion, respectively. Since no benchmark for SDT2I exists, we further construct Dense-Synthetic-View (DSynView), a new synthetic dataset containing spherical layouts to evaluate our models. Our results show that MSTD outperforms MPF across image quality as well as prompt- and layout adherence. MultiPanFusion generates more diverse images but struggles to synthesize flawless foreground objects. We propose bootstrap-coupling and turning off equirectangular perspective-projection attention in the foreground as an improvement of MPF. Link to code https://github.com/sdt2i/spherical-dense-text-to-image

Spherical Dense Text-to-Image Synthesis

TL;DR

This work addresses the lack of a unified framework for spherical dense text-to-image synthesis by marrying dense-text prompts with 360° panoramas. It presents two plug-and-play methods, MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF), that fuse training-free dense diffusion with panoramic backbones StitchDiffusion and PanFusion, and introduces Dense-Synthetic-View (DSynView) to benchmark SDT2I. The study finds that MSTD generally matches baselines in image quality and layout adherence, while MPF yields more diverse scenes but introduces foreground artifacts, prompting bootstrapping and EPPA-based improvements. The results underscore trade-offs between fidelity and diversity in SDT2I and highlight practical tuning strategies for better prompt/layout adherence in spherical spaces, with DSynView enabling robust evaluation for future work.

Abstract

Recent advancements in text-to-image (T2I) have improved synthesis results, but challenges remain in layout control and generating omnidirectional panoramic images. Dense T2I (DT2I) and spherical T2I (ST2I) models address these issues, but so far no unified approach exists. Trivial approaches, like prompting a DT2I model to generate panoramas can not generate proper spherical distortions and seamless transitions at the borders. Our work shows that spherical dense text-to-image (SDT2I) can be achieved by integrating training-free DT2I approaches into finetuned panorama models. Specifically, we propose MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF) by integrating MultiDiffusion into StitchDiffusion and PanFusion, respectively. Since no benchmark for SDT2I exists, we further construct Dense-Synthetic-View (DSynView), a new synthetic dataset containing spherical layouts to evaluate our models. Our results show that MSTD outperforms MPF across image quality as well as prompt- and layout adherence. MultiPanFusion generates more diverse images but struggles to synthesize flawless foreground objects. We propose bootstrap-coupling and turning off equirectangular perspective-projection attention in the foreground as an improvement of MPF. Link to code https://github.com/sdt2i/spherical-dense-text-to-image

Paper Structure

This paper contains 17 sections, 1 equation, 22 figures, 9 tables.

Figures (22)

  • Figure 1: Task comparison: Traditional text-to-image (top-left) generates images based on a single global prompt. Dense text-to-image (top-right) introduces masks to control the spatial layout. Spherical text-to-image (bottom left) synthesizes 360x180-degree panoramas with seamless transition over the border and distortions at the poles. In this work, we integrate both approaches to enable controllable spherical dense text-to-image synthesis (bottom-right).
  • Figure 2: The influence of bootstrapping on our metrics for every approach, showing a functional relationship which is non-monotonous at FID.
  • Figure 3: Prompt and mask adherence of MultiStitchDiffusion; showing a high IoU but some objects, e.g. the building or the wardrobe, not blending well into the background.
  • Figure 4: Prompt and mask adherence of MultiPanFusion being satisfactory in some examples. The middle-left image, however, shows a failure case.
  • Figure 5: Various failure-cases of MultiPanFusion, visualizing noise around the objects, unfitting ceilings, wrong objects, and foreground elements merging with the background.
  • ...and 17 more figures