Table of Contents
Fetching ...

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

TL;DR

This paper tackles the challenge of generating semantically coherent panoramas with diffusion models. It introduces the Merge-Attend-Diffuse (MAD) operator, which merges latent features from multiple overlapping views and injects them into attention layers to promote global coherence, while remaining plug-and-play for pretrained backbones. The authors demonstrate, through extensive quantitative and qualitative evaluations and a user study, that MAD improves semantic coherence and prompt adherence without sacrificing realism, and they provide release of the code. The work advances zero-shot long-image generation by enabling controllable interaction across views during denoising, with practical impact for high-resolution panorama synthesis across varying aspect ratios and backbones.

Abstract

Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD.

Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

TL;DR

This paper tackles the challenge of generating semantically coherent panoramas with diffusion models. It introduces the Merge-Attend-Diffuse (MAD) operator, which merges latent features from multiple overlapping views and injects them into attention layers to promote global coherence, while remaining plug-and-play for pretrained backbones. The authors demonstrate, through extensive quantitative and qualitative evaluations and a user study, that MAD improves semantic coherence and prompt adherence without sacrificing realism, and they provide release of the code. The work advances zero-shot long-image generation by enabling controllable interaction across views during denoising, with practical impact for high-resolution panorama synthesis across varying aspect ratios and backbones.

Abstract

Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD.
Paper Structure (16 sections, 4 equations, 38 figures, 9 tables)

This paper contains 16 sections, 4 equations, 38 figures, 9 tables.

Figures (38)

  • Figure 1: MultiDiffusion bar2023multidiffusion blends the mountains with the clouds, lacking semantic coherence (top). Applying the proposed MAD leads to semantic coherence (bottom).
  • Figure 1: Long images generated by the considered LDM with MAD applied up to different numbers of inference steps for the prompt A herd of Mustang horses crossing a river at sunset. When $\tau$ is too low, the view interactions are not enough to produce a globally coherent image. As $\tau$ increases, the image becomes more and more coherent, with maximal uniformity when MAD is applied at every timestep.
  • Figure 2: Overview of our inference-time pipeline (left) and its pseudo-code (right). During the diffusion process, the image is split into overlapping views, and each is fed to the model separately. Within the attention layers, MAD provides interaction points between the views, enforcing global coherence in the generated panorama.
  • Figure 3: Long images generated by the considered LDM with MAD, with $\tau{=}$15, applied in different blocks of the noise prediction model for the prompt A snowy winter landscape with frosted trees and a frozen lake. None is the setting where MAD is never applied.
  • Figure 4: Comparison on the distribution coverage of panoramas generated by MAD (red) and SD-L (blue) with respect to square images generated by SD (gray). Darker areas indicate an overlap between the embeddings of MAD and SD-L images.
  • ...and 33 more figures