Table of Contents
Fetching ...

Taming Stable Diffusion for Text to 360° Panorama Image Generation

Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xiaoshui Huang, Dinh Phung, Wanli Ouyang, Jianfei Cai

TL;DR

This paper introduces a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt and proposes a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process.

Abstract

Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at https://chengzhag.github.io/publication/panfusion.

Taming Stable Diffusion for Text to 360° Panorama Image Generation

TL;DR

This paper introduces a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt and proposes a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process.

Abstract

Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at https://chengzhag.github.io/publication/panfusion.
Paper Structure (22 sections, 5 equations, 25 figures, 7 tables)

This paper contains 22 sections, 5 equations, 25 figures, 7 tables.

Figures (25)

  • Figure 1: Our PanFusion can generate realistic and consistent $360^\circ$ horizontal by $180^\circ$ vertical FOV panoramas from a single text prompt, compared to the limited FOV of current state-of-the-art method MVDiffusion tang2023mvdiffusion. Left: PanFusion addresses the problem of repetitive elements (duplicated "ceiling fans") and inconsistency (the ceiling and wall in the center) of MVDiffusion. Right: While trained mostly on indoor scenes, PanFusion can generalize well to out-of-domain outdoor prompts with more reasonable layout.
  • Figure 2: Our proposed dual-branch PanFusion pipeline. The panorama branch (upper) provides global layout guidance and registers the perspective information to get seamless panorama output. The perspective branch (lower) harnesses the rich prior knowledge of Stable Diffusion (SD) and provides guidance to alleviate distortion under perspective projection. Both branches employ the same UNet backbone with shared weights, while finetuned with separate LoRA layers. Equirectangular-Perspective Projection Attention (EPPA) modules are plugged into different layers of the UNet to pass information between the two branches.
  • Figure 3: Equirectangular-Perspective Projection (EPP) Attention. As EPP attention module is designed to be bijective to pass information in both directions, we only illustrated the direction of registering perspective information to panorama.
  • Figure 4: Qualitative comparisons of text-conditioned panorama generation. We show panoramas cropped to the vertical FoV of MVDiffusion tang2023mvdiffusion. Below each panorama, we show 4 evenly spaced perspective projections, with the first view crossing the left and right boundaries. We highlight the loop inconsistency, distorted lines and repetitive objects and unreasonable furniture layout of baseline methods with corresponding colors of boxes, which are addressed by our method. More results can be found in \ref{['sec:suppl_comp']} of the supplementary.
  • Figure 5: Ablation study. Artifacts are highlighted with red boxes and projected to perspective views. Prompt: "A hallway in a hotel."
  • ...and 20 more figures