Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas
Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara
TL;DR
This paper tackles the challenge of generating semantically coherent panoramas with diffusion models. It introduces the Merge-Attend-Diffuse (MAD) operator, which merges latent features from multiple overlapping views and injects them into attention layers to promote global coherence, while remaining plug-and-play for pretrained backbones. The authors demonstrate, through extensive quantitative and qualitative evaluations and a user study, that MAD improves semantic coherence and prompt adherence without sacrificing realism, and they provide release of the code. The work advances zero-shot long-image generation by enabling controllable interaction across views during denoising, with practical impact for high-resolution panorama synthesis across varying aspect ratios and backbones.
Abstract
Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD.
