LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge

Kyoungkook Kang; Gyujin Sim; Geonung Kim; Donguk Kim; Seungho Nam; Sunghyun Cho

LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge

Kyoungkook Kang, Gyujin Sim, Geonung Kim, Donguk Kim, Seungho Nam, Sunghyun Cho

TL;DR

LayeringDiff reframes layered image synthesis as a decomposition problem: it first generates a composite image with a pretrained generator and then recovers foreground and background layers using a diffusion-based Foreground-Background Diffusion Decomposition (FBDD) module and a High-Frequency Alignment (HFA) module. This approach avoids large-scale, fine-tuned training for layer-specific content and leverages robust generative priors to achieve diverse, well-proportioned layers, refined textures, and seamless composition. Extensive experiments, including a user study, show improved foreground/background quality, natural blending, and broad applicability to multi-layer synthesis and real-world image decomposition. The method demonstrates practical benefits in terms of diversity, realism, and flexibility, while acknowledging limitations in alpha accuracy and shadow handling that are discussed further in the supplementary material.

Abstract

Layers have become indispensable tools for professional artists, allowing them to build a hierarchical structure that enables independent control over individual visual elements. In this paper, we propose LayeringDiff, a novel pipeline for the synthesis of layered images, which begins by generating a composite image using an off-the-shelf image generative model, followed by disassembling the image into its constituent foreground and background layers. By extracting layers from a composite image, rather than generating them from scratch, LayeringDiff bypasses the need for large-scale training to develop generative capabilities for individual layers. Furthermore, by utilizing a pretrained off-the-shelf generative model, our method can produce diverse contents and object scales in synthesized layers. For effective layer decomposition, we adapt a large-scale pretrained generative prior to estimate foreground and background layers. We also propose high-frequency alignment modules to refine the fine-details of the estimated layers. Our comprehensive experiments demonstrate that our approach effectively synthesizes layered images and supports various practical applications.

LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 8 figures, 4 tables)

This paper contains 25 sections, 3 equations, 8 figures, 4 tables.

Introduction
Related Work
Text-based Layered Image Synthesis
Image Matting and Inpainting
LayeringDiff
Initial Image Generation Stage
Foreground Determination Stage
Layering Stage
FBDD module
HFA module
Training of LayeringDiff
Dataset construction
Training of the HFA module
Experiments
Comparative Evaluation
...and 10 more sections

Figures (8)

Figure 1: Overview of LayeringDiff. From an input text prompt $T$ including foreground prompt $T_F$ (red words), initial image generation stage synthesizes an initial composite image $C_i$. Then, foreground determination stage identifies a foreground region based on the foreground prompt $T_F$ and produce an alpha mask $\alpha$. Lastly, layering stage separates $C_i$ into a foreground layer $F$ and a background layer $B$.
Figure 2: Decomposed layers by the FBDD module may suffer from degraded texture quality (c). HFA module enhance high-frequency details in these layers (d) using those from the initial composite image (a). Note that the text in the background is covered by the semi-transparent plane in the foreground layer in (a).
Figure 3: Qualitative comparison of backgrounds $B$ produced by BANs trained using different loss functions. The inset in the top-left image represents input composite image.
Figure 4: Qualitative comparison of layered images generated by the three models of LayerDiffuse layerdiffuse and our method for input prompts, positioned at the top of each example. In each prompt, the red words denote the foreground prompt, while the blue words represents the background prompt. LayerDiffuse models tends to produce foreground objects disproportionately large relative to the background, whereas our method generates realistic, well-proportioned layered images.
Figure 5: Qualitative comparison of layered images generated by Text2Layer text2layer, LayerDiff layerdiff, and our method for an input prompt, positioned at the top of the figure.
...and 3 more figures

LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge

TL;DR

Abstract

LayeringDiff: Layered Image Synthesis via Generation, then Disassembly with Generative Knowledge

Authors

TL;DR

Abstract

Table of Contents

Figures (8)