Transparent Image Layer Diffusion using Latent Transparency
Lvmin Zhang, Maneesh Agrawala
TL;DR
The paper tackles the lack of scalable, high-quality transparent image generation by introducing latent transparency, a latent-offset mechanism that encodes alpha channels into a pretrained latent diffusion model without disrupting its latent distribution. It jointly trains a latent transparency encoder/decoder and leverages shared attention plus LoRAs to support multi-layer generation, enabling foreground/background conditioning and harmonious layering. A large human-in-the-loop dataset (1M transparent image pairs, plus 1M multi-layer pairs) underpins training, and extensive experiments show native transparent outputs outperform ad-hoc generate-then-matting with 97% user preference, while achieving quality comparable to commercial assets like Adobe Stock. The work demonstrates practical impact for content creation workflows by enabling layer-based, editable transparency directly from diffusion models and offering broad compatibility with community models and control signals.
Abstract
We present LayerDiffuse, an approach enabling large-scale pretrained latent diffusion models to generate transparent images. The method allows generation of single transparent images or of multiple transparent layers. The method learns a "latent transparency" that encodes alpha channel transparency into the latent manifold of a pretrained latent diffusion model. It preserves the production-ready quality of the large diffusion model by regulating the added transparency as a latent offset with minimal changes to the original latent distribution of the pretrained model. In this way, any latent diffusion model can be converted into a transparent image generator by finetuning it with the adjusted latent space. We train the model with 1M transparent image layer pairs collected using a human-in-the-loop collection scheme. We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc. A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.
