Table of Contents
Fetching ...

LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors

Yusuf Dalva, Yijun Li, Qing Liu, Nanxuan Zhao, Jianming Zhang, Zhe Lin, Pinar Yanardag

TL;DR

LayerFusion tackles the challenge of generating layered content by producing a foreground RGBA and a background RGB simultaneously, enabling harmonized interaction between layers. It introduces attention-based priors from the foreground generator and an attention-level blending scheme to jointly shape both layers and the final blended image. Key contributions include extracting structure and content priors from self- and cross-attention, a soft/hard blending mask mechanism, and an attention-sharing strategy to maintain background consistency. Experimental results show improved visual coherence, layer consistency, and spatial editability over prior layered-generation methods, underscoring its potential to enhance creative workflows.

Abstract

Large-scale diffusion models have achieved remarkable success in generating high-quality images from textual descriptions, gaining popularity across various applications. However, the generation of layered content, such as transparent images with foreground and background layers, remains an under-explored area. Layered content generation is crucial for creative workflows in fields like graphic design, animation, and digital art, where layer-based approaches are fundamental for flexible editing and composition. In this paper, we propose a novel image generation pipeline based on Latent Diffusion Models (LDMs) that generates images with two layers: a foreground layer (RGBA) with transparency information and a background layer (RGB). Unlike existing methods that generate these layers sequentially, our approach introduces a harmonized generation mechanism that enables dynamic interactions between the layers for more coherent outputs. We demonstrate the effectiveness of our method through extensive qualitative and quantitative experiments, showing significant improvements in visual coherence, image quality, and layer consistency compared to baseline methods.

LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors

TL;DR

LayerFusion tackles the challenge of generating layered content by producing a foreground RGBA and a background RGB simultaneously, enabling harmonized interaction between layers. It introduces attention-based priors from the foreground generator and an attention-level blending scheme to jointly shape both layers and the final blended image. Key contributions include extracting structure and content priors from self- and cross-attention, a soft/hard blending mask mechanism, and an attention-sharing strategy to maintain background consistency. Experimental results show improved visual coherence, layer consistency, and spatial editability over prior layered-generation methods, underscoring its potential to enhance creative workflows.

Abstract

Large-scale diffusion models have achieved remarkable success in generating high-quality images from textual descriptions, gaining popularity across various applications. However, the generation of layered content, such as transparent images with foreground and background layers, remains an under-explored area. Layered content generation is crucial for creative workflows in fields like graphic design, animation, and digital art, where layer-based approaches are fundamental for flexible editing and composition. In this paper, we propose a novel image generation pipeline based on Latent Diffusion Models (LDMs) that generates images with two layers: a foreground layer (RGBA) with transparency information and a background layer (RGB). Unlike existing methods that generate these layers sequentially, our approach introduces a harmonized generation mechanism that enables dynamic interactions between the layers for more coherent outputs. We demonstrate the effectiveness of our method through extensive qualitative and quantitative experiments, showing significant improvements in visual coherence, image quality, and layer consistency compared to baseline methods.

Paper Structure

This paper contains 21 sections, 2 equations, 22 figures, 1 table, 3 algorithms.

Figures (22)

  • Figure 1: LayerFusion. We propose a framework for generating a foreground (RGBA), background (RGB) and blended (RGB) image simultaneously from an input text prompt. By introducing an optimization-free blending approach that targets the attention layers, we introduce an interaction mechanism between the image layers (i.e., foreground and background) to achieve harmonization during blending. Furthermore, as our framework benefits from the layered representations, it enables performing spatial editing with the generated image layers in a straight-forward manner.
  • Figure 2: LayerFusion Framework. By making use of the generative priors extracted from transparent generation model $\epsilon_{\theta, FG}$, LayerFusion is able to generate image triplets consisting a foreground (RGBA), a background, and a blended image. Our framework involves three fundamental components that are connected with each other. First we introduce a prior pass on $\epsilon_{\theta, FG}$ (a) for extracting the structure prior, and then introduce an attention-level interaction between two denoising networks ($\epsilon_{\theta, FG}$ and $\epsilon_{\theta}$) (b), with an attention level blending scheme with layer-wise content confidence prior, combined with the structure prior (c).
  • Figure 3: Visualization of the masks extracted as generative priors. Throughout the generation process, we extract a structure prior $s$ and a content confidence prior $c$. To combine the structure and content information, we construct $mask_{soft}$ and $mask_{hard}$ during the blending process. As visible from the provided maps (as priors), We can both capture the overall object structure with the structure prior $s$ and incorporate the content with $c$, where their combination provides a precise mask reflecting both quantities (see the example "the car"). Also note that the masks we construct also capture transparency information throughout the masking process (see the example "a glass bottle"). We retrieve the provided masks for the diffusion timestep $t = 0.8T$.
  • Figure 4: Qualitative Results. We present qualitative results on multi-layer generation over different visual concepts. In each column, we show the high-quality results of foreground layer, background layer and their generative blending respectively, in terms of text-image alignment, transparency and harmonization. We present more results in the supplementary material.
  • Figure 5: We perform extensive ablation studies on the effect of (a) Background Influence on Foreground: Background changes (e.g., weather) dynamically adjust the foreground (e.g., outfit) while preserving identity. (b) Alpha vs. Generative Blending: Alpha Blending ensures a perfect match, while Generative Blending creates more realistic harmonization by handling shadows and lighting. (c) Self-Attention vs. Combined Attention Masks: Self-attention alone causes leaks; cross-attention alone affects the entire image. Combining both achieves sharper boundaries and better coherence. (d) Soft Decision Boundary Coefficient: Lower coefficients cause leaks; higher coefficients yield more precise alpha and consistent blending (e.g., the pocket of the man's clothing).
  • ...and 17 more figures