Text2Layer: Layered Image Generation using Latent Diffusion Model

Xinyang Zhang; Wentian Zhao; Xin Lu; Jeff Chien

Text2Layer: Layered Image Generation using Latent Diffusion Model

Xinyang Zhang, Wentian Zhao, Xin Lu, Jeff Chien

TL;DR

This work presents Text2Layer, a diffusion-based approach to generate layered images by modeling a foreground $F$, background $B$, and a layer mask $m$ along with the composed image $C = mF + (1-m)B$, guided by text. It introduces the Composition-Aware Two-Layer Autoencoder (CaT2I-AE) and trains diffusion models in its latent space to produce coherent layered outputs, enabling explicit layer-level editing. A large-scale LAION-L2I dataset (57.02M filtered samples) is built from LAION-Aesthetics using saliency-based segmentation and inpainting, with quality filters to ensure reliable training data. Empirical results show CaT2I-AE-SD achieves superior image fidelity (lower FID), higher text-image relevance (CLIP), and more accurate masks (IOU) compared to baselines, establishing a benchmark for layered-image diffusion and enabling practical layer-based editing workflows.

Abstract

Layer compositing is one of the most popular image editing workflows among both amateurs and professionals. Motivated by the success of diffusion models, we explore layer compositing from a layered image generation perspective. Instead of generating an image, we propose to generate background, foreground, layer mask, and the composed image simultaneously. To achieve layered image generation, we train an autoencoder that is able to reconstruct layered images and train diffusion models on the latent representation. One benefit of the proposed problem is to enable better compositing workflows in addition to the high-quality image output. Another benefit is producing higher-quality layer masks compared to masks produced by a separate step of image segmentation. Experimental results show that the proposed method is able to generate high-quality layered images and initiates a benchmark for future work.

Text2Layer: Layered Image Generation using Latent Diffusion Model

TL;DR

This work presents Text2Layer, a diffusion-based approach to generate layered images by modeling a foreground

, background

, and a layer mask

along with the composed image

, guided by text. It introduces the Composition-Aware Two-Layer Autoencoder (CaT2I-AE) and trains diffusion models in its latent space to produce coherent layered outputs, enabling explicit layer-level editing. A large-scale LAION-L2I dataset (57.02M filtered samples) is built from LAION-Aesthetics using saliency-based segmentation and inpainting, with quality filters to ensure reliable training data. Empirical results show CaT2I-AE-SD achieves superior image fidelity (lower FID), higher text-image relevance (CLIP), and more accurate masks (IOU) compared to baselines, establishing a benchmark for layered-image diffusion and enabling practical layer-based editing workflows.

Abstract

Paper Structure (25 sections, 7 equations, 18 figures, 5 tables)

This paper contains 25 sections, 7 equations, 18 figures, 5 tables.

Introduction
Related Work
Synthesizing High-Quality Layered Images
Definition of Two-Layer Image
Overview of LAION-L2I Dataset Construction
Extracting foreground and background parts
Quality filtering
Modeling
Text2Layer Formulation
CaT2I-AE Architecture
Training Objective
Experiments
Baseline Methods
Metrics
Implementation Details
...and 10 more sections

Figures (18)

Figure 1: Examples of two-layer images. Prompts are displayed on the top of images. Each example includes foreground (fg), background (bg), and mask component to compose a two-layer image. From left to right of each example: fg, bg, mask, and composed image.
Figure 2: Layer mask examples. The scale, location, and number of objects vary largely.
Figure 3: Failure cases of salient object segmentation (left) and inpainting (middle and right). The red shaded regions indicate the object to be removed.
Figure 4: Predicted good and bad salient masks and inpaintings
Figure 5: Composited samples from CaT2I-AE-SD and baseline models for $256 \times 256$ resolution. The prompts are (1) Dogs at the entrance of Arco Iris Boutique and (2) Haunted Mansion Holiday at Disneyland Park. Each $2 \times 2$ block displays the composited images and masks $m$ of the corresponding models. Find more in the supplement.
...and 13 more figures

Text2Layer: Layered Image Generation using Latent Diffusion Model

TL;DR

Abstract

Text2Layer: Layered Image Generation using Latent Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (18)