Table of Contents
Fetching ...

Controllable Layer Decomposition for Reversible Multi-Layer Image Generation

Zihao Liu, Zunnan Xu, Shi Shu, Jun Zhou, Ruicheng Zhang, Zhenchao Tang, Xiu Li

TL;DR

This work tackles the irreversibility of conventional raster compositing by introducing Controllable Layer Decomposition (CLD), which yields fine-grained, user-guided separation of an image into multiple RGBA layers. The approach combines LayerDecompose-DiT (LD-DiT) with a Multi-Layer Conditional Adapter (MLCA) and a dual-condition classifier-free guidance (CFG) strategy, all trained within a Flow Matching diffusion framework and enhanced by Layer-Aware Rotary Position Embedding (LA-RoPE). A new PrismLayersPro-based benchmark and tailored metrics evaluate layer quality, transparency accuracy, and reconstruction fidelity, demonstrating superior controllability and visual coherence over baselines like LayerD. The generated layers are directly usable for downstream editing in design tools (e.g., PowerPoint), enabling reversible and flexible multi-layer workflows with practical real-world impact. CLD thus advances controllable multi-layer image generation and offers tangible benefits for graphic design pipelines through precise, box-guided layer decomposition and coherent cross-layer generation.

Abstract

This work presents Controllable Layer Decomposition (CLD), a method for achieving fine-grained and controllable multi-layer separation of raster images. In practical workflows, designers typically generate and edit each RGBA layer independently before compositing them into a final raster image. However, this process is irreversible: once composited, layer-level editing is no longer possible. Existing methods commonly rely on image matting and inpainting, but remain limited in controllability and segmentation precision. To address these challenges, we propose two key modules: LayerDecompose-DiT (LD-DiT), which decouples image elements into distinct layers and enables fine-grained control; and Multi-Layer Conditional Adapter (MLCA), which injects target image information into multi-layer tokens to achieve precise conditional generation. To enable a comprehensive evaluation, we build a new benchmark and introduce tailored evaluation metrics. Experimental results show that CLD consistently outperforms existing methods in both decomposition quality and controllability. Furthermore, the separated layers produced by CLD can be directly manipulated in commonly used design tools such as PowerPoint, highlighting its practical value and applicability in real-world creative workflows. Our project is available at https://monkek123king.github.io/CLD_page/.

Controllable Layer Decomposition for Reversible Multi-Layer Image Generation

TL;DR

This work tackles the irreversibility of conventional raster compositing by introducing Controllable Layer Decomposition (CLD), which yields fine-grained, user-guided separation of an image into multiple RGBA layers. The approach combines LayerDecompose-DiT (LD-DiT) with a Multi-Layer Conditional Adapter (MLCA) and a dual-condition classifier-free guidance (CFG) strategy, all trained within a Flow Matching diffusion framework and enhanced by Layer-Aware Rotary Position Embedding (LA-RoPE). A new PrismLayersPro-based benchmark and tailored metrics evaluate layer quality, transparency accuracy, and reconstruction fidelity, demonstrating superior controllability and visual coherence over baselines like LayerD. The generated layers are directly usable for downstream editing in design tools (e.g., PowerPoint), enabling reversible and flexible multi-layer workflows with practical real-world impact. CLD thus advances controllable multi-layer image generation and offers tangible benefits for graphic design pipelines through precise, box-guided layer decomposition and coherent cross-layer generation.

Abstract

This work presents Controllable Layer Decomposition (CLD), a method for achieving fine-grained and controllable multi-layer separation of raster images. In practical workflows, designers typically generate and edit each RGBA layer independently before compositing them into a final raster image. However, this process is irreversible: once composited, layer-level editing is no longer possible. Existing methods commonly rely on image matting and inpainting, but remain limited in controllability and segmentation precision. To address these challenges, we propose two key modules: LayerDecompose-DiT (LD-DiT), which decouples image elements into distinct layers and enables fine-grained control; and Multi-Layer Conditional Adapter (MLCA), which injects target image information into multi-layer tokens to achieve precise conditional generation. To enable a comprehensive evaluation, we build a new benchmark and introduce tailored evaluation metrics. Experimental results show that CLD consistently outperforms existing methods in both decomposition quality and controllability. Furthermore, the separated layers produced by CLD can be directly manipulated in commonly used design tools such as PowerPoint, highlighting its practical value and applicability in real-world creative workflows. Our project is available at https://monkek123king.github.io/CLD_page/.

Paper Structure

This paper contains 23 sections, 11 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Our framework utilizes a main backbone and a parallel control module for precise layer decomposition. (a) The overall CLD architecture, showing the LayerDecompose-DiT (LD-DiT) backbone responsible for generating the multi-layer latent. (b) The detailed structure of the Multi-Layer Conditional Adapter (MLCA). MLCA additively fuses features from the conditional image with the LD-DiT's hidden states, then performs hierarchical cropping based on the input bounding boxes to create a multi-layer guidance token sequence.
  • Figure 2: Overview of our adapted Multi-Layer RGBA image Decoder Architecture and Layer-Aware Rotary Position Encoding.
  • Figure 3: Unlike LayerD suzuki2025layerd, which offers coarse separation and lacks user control, our method uses bounding boxes to guide a more fine-grained and controllable process. This results in better precision, visual quality, and hierarchical consistency in complex scenarios.
  • Figure 4: Ablation study on the impact of decoder choices and the CFG unconditional image condition.
  • Figure 5: Unlike segmentation (SAM2 ravi2024sam) and matting (ZIM kim2025zim) models, which fail on complex design images, our generative approach produces clearer, more coherent layering.
  • ...and 8 more figures