Text2Layer: Layered Image Generation using Latent Diffusion Model
Xinyang Zhang, Wentian Zhao, Xin Lu, Jeff Chien
TL;DR
This work presents Text2Layer, a diffusion-based approach to generate layered images by modeling a foreground $F$, background $B$, and a layer mask $m$ along with the composed image $C = mF + (1-m)B$, guided by text. It introduces the Composition-Aware Two-Layer Autoencoder (CaT2I-AE) and trains diffusion models in its latent space to produce coherent layered outputs, enabling explicit layer-level editing. A large-scale LAION-L2I dataset (57.02M filtered samples) is built from LAION-Aesthetics using saliency-based segmentation and inpainting, with quality filters to ensure reliable training data. Empirical results show CaT2I-AE-SD achieves superior image fidelity (lower FID), higher text-image relevance (CLIP), and more accurate masks (IOU) compared to baselines, establishing a benchmark for layered-image diffusion and enabling practical layer-based editing workflows.
Abstract
Layer compositing is one of the most popular image editing workflows among both amateurs and professionals. Motivated by the success of diffusion models, we explore layer compositing from a layered image generation perspective. Instead of generating an image, we propose to generate background, foreground, layer mask, and the composed image simultaneously. To achieve layered image generation, we train an autoencoder that is able to reconstruct layered images and train diffusion models on the latent representation. One benefit of the proposed problem is to enable better compositing workflows in addition to the high-quality image output. Another benefit is producing higher-quality layer masks compared to masks produced by a separate step of image segmentation. Experimental results show that the proposed method is able to generate high-quality layered images and initiates a benchmark for future work.
