Table of Contents
Fetching ...

DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode

Junjia Huang, Pengxiang Yan, Jinhang Cai, Jiyang Liu, Zhao Wang, Yitong Wang, Xinglong Wu, Guanbin Li

TL;DR

DreamLayer tackles the challenge of coherent multi-layer diffusion generation by modeling the relationships among a background $I^1$, multiple foreground layers $I^i$, and a global layer $I^{k+1}$. It introduces Context-Aware Cross-Attention, Layer-Shared Self-Attention, and Information Retained Harmonization to align layouts, share inter-layer information, and fuse layers in latent space, respectively, backed by a large 400k multi-layer dataset and a versatile generation/decomposition pipeline. The approach yields more harmonious multi-layer compositions, improves occlusion and shadow realism, and enables training-free image-to-layer decomposition plus latent-space editing, demonstrated through extensive experiments and user studies. The work has practical impact for design and editing tasks, offering a scalable framework for flexible, editing-friendly multi-layer image synthesis with robust inter-layer coherence.

Abstract

Text-driven image generation using diffusion models has recently gained significant attention. To enable more flexible image manipulation and editing, recent research has expanded from single image generation to transparent layer generation and multi-layer compositions. However, existing approaches often fail to provide a thorough exploration of multi-layer structures, leading to inconsistent inter-layer interactions, such as occlusion relationships, spatial layout, and shadowing. In this paper, we introduce DreamLayer, a novel framework that enables coherent text-driven generation of multiple image layers, by explicitly modeling the relationship between transparent foreground and background layers. DreamLayer incorporates three key components, i.e., Context-Aware Cross-Attention (CACA) for global-local information exchange, Layer-Shared Self-Attention (LSSA) for establishing robust inter-layer connections, and Information Retained Harmonization (IRH) for refining fusion details at the latent level. By leveraging a coherent full-image context, DreamLayer builds inter-layer connections through attention mechanisms and applies a harmonization step to achieve seamless layer fusion. To facilitate research in multi-layer generation, we construct a high-quality, diverse multi-layer dataset including 400k samples. Extensive experiments and user studies demonstrate that DreamLayer generates more coherent and well-aligned layers, with broad applicability, including latent-space image editing and image-to-layer decomposition.

DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode

TL;DR

DreamLayer tackles the challenge of coherent multi-layer diffusion generation by modeling the relationships among a background , multiple foreground layers , and a global layer . It introduces Context-Aware Cross-Attention, Layer-Shared Self-Attention, and Information Retained Harmonization to align layouts, share inter-layer information, and fuse layers in latent space, respectively, backed by a large 400k multi-layer dataset and a versatile generation/decomposition pipeline. The approach yields more harmonious multi-layer compositions, improves occlusion and shadow realism, and enables training-free image-to-layer decomposition plus latent-space editing, demonstrated through extensive experiments and user studies. The work has practical impact for design and editing tasks, offering a scalable framework for flexible, editing-friendly multi-layer image synthesis with robust inter-layer coherence.

Abstract

Text-driven image generation using diffusion models has recently gained significant attention. To enable more flexible image manipulation and editing, recent research has expanded from single image generation to transparent layer generation and multi-layer compositions. However, existing approaches often fail to provide a thorough exploration of multi-layer structures, leading to inconsistent inter-layer interactions, such as occlusion relationships, spatial layout, and shadowing. In this paper, we introduce DreamLayer, a novel framework that enables coherent text-driven generation of multiple image layers, by explicitly modeling the relationship between transparent foreground and background layers. DreamLayer incorporates three key components, i.e., Context-Aware Cross-Attention (CACA) for global-local information exchange, Layer-Shared Self-Attention (LSSA) for establishing robust inter-layer connections, and Information Retained Harmonization (IRH) for refining fusion details at the latent level. By leveraging a coherent full-image context, DreamLayer builds inter-layer connections through attention mechanisms and applies a harmonization step to achieve seamless layer fusion. To facilitate research in multi-layer generation, we construct a high-quality, diverse multi-layer dataset including 400k samples. Extensive experiments and user studies demonstrate that DreamLayer generates more coherent and well-aligned layers, with broad applicability, including latent-space image editing and image-to-layer decomposition.

Paper Structure

This paper contains 24 sections, 16 equations, 22 figures, 6 tables.

Figures (22)

  • Figure 1: DreamLayer can handle multiple tasks: (a) Text-to-layer: Given a text input, we use GPT-4 to decompose foreground and background elements, feeding them into DreamLayer to generate a multi-layered image. (b) Image-to-layer: By using inversion to initialize starting latent, DreamLayer can decompose an image based on text prompts. (c) Latent-space editing: During denoising, DreamLayer can respond to editing instructions, producing more harmonious and consistent edited images.
  • Figure 2: Multi-layer Dataset: Each image consists of a multi-layer structure, including a background and multiple foreground objects, with each foreground object represented as a transparent layer.
  • Figure 3: The DreamLayer Framework for Multi-Layer Image Generation: During the generation process, background and foreground prompts are combined via layer assign embeddings to form a global prompt $C_t^{k+1}$. In the attention phase, CACA extracts a context map from the global layer. Subsequently, the contextual information is fused across layers through LSSA, based on the global context map. Finally, IRH fuses the images using the latent image during the denoising process, achieving a harmonious result.
  • Figure 4: Overview of the Attention Mechanism in DreamLayer: (a) Context-Aware Cross-Attention for extracting the global context map and guiding the foreground layer layout; (b) Layer-Shared Self-Attention for establishing inter-layer connections and ensuring consistency.
  • Figure 5: The pipeline of multi-layer data preparation. We utilize GPT-4 to process a randomly selected base prompt, structuring it into a background prompt and multiple foreground prompts. After generating the image using a diffusion model, we apply an open-set detection model GroundingDINO to identify the positions of the foreground objects and use the DepthAnything model to obtain a depth map. Based on the depth order, we sequentially extract the foreground layers and fill in the missing areas with an inpainting model.
  • ...and 17 more figures