Table of Contents
Fetching ...

Transparent Image Layer Diffusion using Latent Transparency

Lvmin Zhang, Maneesh Agrawala

TL;DR

The paper tackles the lack of scalable, high-quality transparent image generation by introducing latent transparency, a latent-offset mechanism that encodes alpha channels into a pretrained latent diffusion model without disrupting its latent distribution. It jointly trains a latent transparency encoder/decoder and leverages shared attention plus LoRAs to support multi-layer generation, enabling foreground/background conditioning and harmonious layering. A large human-in-the-loop dataset (1M transparent image pairs, plus 1M multi-layer pairs) underpins training, and extensive experiments show native transparent outputs outperform ad-hoc generate-then-matting with 97% user preference, while achieving quality comparable to commercial assets like Adobe Stock. The work demonstrates practical impact for content creation workflows by enabling layer-based, editable transparency directly from diffusion models and offering broad compatibility with community models and control signals.

Abstract

We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.

Transparent Image Layer Diffusion using Latent Transparency

TL;DR

The paper tackles the lack of scalable, high-quality transparent image generation by introducing latent transparency, a latent-offset mechanism that encodes alpha channels into a pretrained latent diffusion model without disrupting its latent distribution. It jointly trains a latent transparency encoder/decoder and leverages shared attention plus LoRAs to support multi-layer generation, enabling foreground/background conditioning and harmonious layering. A large human-in-the-loop dataset (1M transparent image pairs, plus 1M multi-layer pairs) underpins training, and extensive experiments show native transparent outputs outperform ad-hoc generate-then-matting with 97% user preference, while achieving quality comparable to commercial assets like Adobe Stock. The work demonstrates practical impact for content creation workflows by enabling layer-based, editable transparency directly from diffusion models and offering broad compatibility with community models and control signals.

Abstract

We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.
Paper Structure (38 sections, 11 equations, 38 figures, 3 tables)

This paper contains 38 sections, 11 equations, 38 figures, 3 tables.

Figures (38)

  • Figure 1: Latent Transparency. Given an input transparent image, our framework encode a "latent transparency" to adjust the latent space of Stable Diffusion. The adjusted latent images can be decoded to reconstruct the color and alpha. This latent space with transparency can be further used in training or fine-tuning pretrained image diffusion models.
  • Figure 2: Model Training. We visualize the training of the base model to generate transparent images, and the training of the multi-layer model to generate multiple layers together. When training the base diffusion model (a), all model weights are trainable, whereas for training the multi-layer model (b), only two LoRAs are trainable (the foreground LoRA and background LoRA).
  • Figure 3: Dataset Preparation. We demonstrate the preparation of the two datasets: the transparent image dataset (base dataset) and multi-layer dataset. The base dataset is collected by downloading online transparent images and a human-in-the-loop training method. The multi-layer dataset is synthesized with our transparent diffusion model and several state-of-the-art models including ChatGPT, SDXL inpaint model, etc. The final scale of each dataset is around 1M.
  • Figure 4: Human-in-the-loop data screening. We visualize sample examples that are preserved versus removed in each round during the dataset collection process. We show examples from the round 1, 5, 10, and 20. The prompts are randomly sampled during the collecting process.
  • Figure 5: Qualitative Results. We showcase various examples of transparent images generated by our model. The prompts for each group is given at the top of the examples. These examples only use our base single-layer model.
  • ...and 33 more figures