Table of Contents
Fetching ...

Text2Layer: Layered Image Generation using Latent Diffusion Model

Xinyang Zhang, Wentian Zhao, Xin Lu, Jeff Chien

TL;DR

This work presents Text2Layer, a diffusion-based approach to generate layered images by modeling a foreground $F$, background $B$, and a layer mask $m$ along with the composed image $C = mF + (1-m)B$, guided by text. It introduces the Composition-Aware Two-Layer Autoencoder (CaT2I-AE) and trains diffusion models in its latent space to produce coherent layered outputs, enabling explicit layer-level editing. A large-scale LAION-L2I dataset (57.02M filtered samples) is built from LAION-Aesthetics using saliency-based segmentation and inpainting, with quality filters to ensure reliable training data. Empirical results show CaT2I-AE-SD achieves superior image fidelity (lower FID), higher text-image relevance (CLIP), and more accurate masks (IOU) compared to baselines, establishing a benchmark for layered-image diffusion and enabling practical layer-based editing workflows.

Abstract

Layer compositing is one of the most popular image editing workflows among both amateurs and professionals. Motivated by the success of diffusion models, we explore layer compositing from a layered image generation perspective. Instead of generating an image, we propose to generate background, foreground, layer mask, and the composed image simultaneously. To achieve layered image generation, we train an autoencoder that is able to reconstruct layered images and train diffusion models on the latent representation. One benefit of the proposed problem is to enable better compositing workflows in addition to the high-quality image output. Another benefit is producing higher-quality layer masks compared to masks produced by a separate step of image segmentation. Experimental results show that the proposed method is able to generate high-quality layered images and initiates a benchmark for future work.

Text2Layer: Layered Image Generation using Latent Diffusion Model

TL;DR

This work presents Text2Layer, a diffusion-based approach to generate layered images by modeling a foreground , background , and a layer mask along with the composed image , guided by text. It introduces the Composition-Aware Two-Layer Autoencoder (CaT2I-AE) and trains diffusion models in its latent space to produce coherent layered outputs, enabling explicit layer-level editing. A large-scale LAION-L2I dataset (57.02M filtered samples) is built from LAION-Aesthetics using saliency-based segmentation and inpainting, with quality filters to ensure reliable training data. Empirical results show CaT2I-AE-SD achieves superior image fidelity (lower FID), higher text-image relevance (CLIP), and more accurate masks (IOU) compared to baselines, establishing a benchmark for layered-image diffusion and enabling practical layer-based editing workflows.

Abstract

Layer compositing is one of the most popular image editing workflows among both amateurs and professionals. Motivated by the success of diffusion models, we explore layer compositing from a layered image generation perspective. Instead of generating an image, we propose to generate background, foreground, layer mask, and the composed image simultaneously. To achieve layered image generation, we train an autoencoder that is able to reconstruct layered images and train diffusion models on the latent representation. One benefit of the proposed problem is to enable better compositing workflows in addition to the high-quality image output. Another benefit is producing higher-quality layer masks compared to masks produced by a separate step of image segmentation. Experimental results show that the proposed method is able to generate high-quality layered images and initiates a benchmark for future work.
Paper Structure (25 sections, 7 equations, 18 figures, 5 tables)

This paper contains 25 sections, 7 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Examples of two-layer images. Prompts are displayed on the top of images. Each example includes foreground (fg), background (bg), and mask component to compose a two-layer image. From left to right of each example: fg, bg, mask, and composed image.
  • Figure 2: Layer mask examples. The scale, location, and number of objects vary largely.
  • Figure 3: Failure cases of salient object segmentation (left) and inpainting (middle and right). The red shaded regions indicate the object to be removed.
  • Figure 4: Predicted good and bad salient masks and inpaintings
  • Figure 5: Composited samples from CaT2I-AE-SD and baseline models for $256 \times 256$ resolution. The prompts are (1) Dogs at the entrance of Arco Iris Boutique and (2) Haunted Mansion Holiday at Disneyland Park. Each $2 \times 2$ block displays the composited images and masks $m$ of the corresponding models. Find more in the supplement.
  • ...and 13 more figures