Table of Contents
Fetching ...

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, Hang Xu

TL;DR

LayerDiff addresses the limitation of single-layer diffusion by enabling text-guided, multi-layered composable image synthesis with a layer-collaborative diffusion framework. It introduces inter-layer attention, a layer-specific prompt enhancer, and self-mask guidance to jointly generate layer images and masks conditioned on a global prompt and per-layer prompts, operating in a latent space. A data construction pipeline yields the MLCID dataset to train LayerDiff, and experiments show competitive quality to whole-image diffusion for two-layer cases, with room to scale performance for three- and four-layer cases as data grows. The approach enables practical workflows for editing and styling at the layer level, offering flexible layer-wise editing, inpainting, and style transfer without extra fine-tuning, though data scale for higher-layer counts remains a key limitation and future work will focus on scalable multi-layer data generation.

Abstract

Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

TL;DR

LayerDiff addresses the limitation of single-layer diffusion by enabling text-guided, multi-layered composable image synthesis with a layer-collaborative diffusion framework. It introduces inter-layer attention, a layer-specific prompt enhancer, and self-mask guidance to jointly generate layer images and masks conditioned on a global prompt and per-layer prompts, operating in a latent space. A data construction pipeline yields the MLCID dataset to train LayerDiff, and experiments show competitive quality to whole-image diffusion for two-layer cases, with room to scale performance for three- and four-layer cases as data grows. The approach enables practical workflows for editing and styling at the layer level, offering flexible layer-wise editing, inpainting, and style transfer without extra fine-tuning, though data scale for higher-layer counts remains a key limitation and future work will focus on scalable multi-layer data generation.

Abstract

Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.
Paper Structure (29 sections, 6 equations, 17 figures, 3 tables)

This paper contains 29 sections, 6 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: An examples of the multi-layered composable image. The multi-layered composable image includes a background layer, a set of foreground layers and the corresponding layer masks. The layer images, layer masks and the layer prompts with the same color are belonging to the same layer. The text-guided multi-layered composable image synthesis is aimed to generate the layer images and layer masks simultaneously under the guidance of global prompt to control the holistic content, and the layer prompts to control the per layer's content. It's able to composite a whole image by assembling these layers according to the masks.
  • Figure 2: Overall architecture of the proposed LayerDiff. LayerDiff performs the multi-layered composable image synthesis by generating the layer images and layer masks simultaneously under the guidance of both the global prompt and layer prompts. The layer-specific prompt enhancer ensures the layer text conditions to guide the content generation in each layer. In the layer collaborative diffusion model, the layer-collaborative attention block learns the cross-layer relationship and injects the text guidance signal into the model.
  • Figure 3: Detailed structure of the proposed layer-collaborative attention block. (a) The UNet attention block is commonly used in the traditional UNet Block for condition-based whole-image generation. (b) To better capture layer-wise features, we introduce the Layer-Collaborative Attention Block, which incorporates a Text-guided Intra-Layer Attention Module to guide layer content generation and an Inter-Layer Attention Module to enable cross-layer interaction. (c) The Layer-Specific Prompt Enhancer is designed to more effectively allow layer-specific prompts to assimilate information from the global prompt.
  • Figure 4: Pipeline of the Multi-Layered Composable Image Construction. We use the InstructBLIP dai2023instructblip for image captioning. These prompts guide open-set segmentation via DetCLIP+SAM to produce image layers and masks and the background image is refined by using the Stable Diffusion inpainting model rombach2022high.
  • Figure 5: Qualitative results of the synthesized two-layered images. We compare our composited image with the samples generated by Stable Diffusion rombach2022high using the whole-image generation approach, and the Text2Layer zhang2023text2layer that also performs the multi-layered image generation. The quality of our multi-layer generation is found to be comparable to the samples produced by Stable Diffusion.
  • ...and 12 more figures