Table of Contents
Fetching ...

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Shengming Yin, Zekai Zhang, Zecheng Tang, Kaiyuan Gao, Xiao Xu, Kun Yan, Jiahao Li, Yilei Chen, Yuxiang Chen, Heung-Yeung Shum, Lionel M. Ni, Jingren Zhou, Junyang Lin, Chenfei Wu

TL;DR

This work tackles the fundamental problem of editing consistency in raster images by introducing Qwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers. It introduces three innovations: RGBA-VAE to unify RGB/RGBA latent spaces, VLD-MMDiT to handle variable numbers of layers, and a multi-stage training regime to adapt pretrained generators to multilayer decompositions. To address data scarcity, it builds a PSD-derived dataset with annotations and captions to train the model for Text-to-Multi-RGBA and Image-to-Multi-RGBA tasks. Experiments demonstrate superior decomposition quality and showcase inherently consistent layer-based editing and multilayer synthesis, signaling a shift toward layer-aware image editing. The work also releases code and models, enabling practical adoption and further research in semantically disentangled image representations.

Abstract

Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose \textbf{Qwen-Image-Layered}, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling \textbf{inherent editability}, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on \href{https://github.com/QwenLM/Qwen-Image-Layered}{https://github.com/QwenLM/Qwen-Image-Layered}

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

TL;DR

This work tackles the fundamental problem of editing consistency in raster images by introducing Qwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers. It introduces three innovations: RGBA-VAE to unify RGB/RGBA latent spaces, VLD-MMDiT to handle variable numbers of layers, and a multi-stage training regime to adapt pretrained generators to multilayer decompositions. To address data scarcity, it builds a PSD-derived dataset with annotations and captions to train the model for Text-to-Multi-RGBA and Image-to-Multi-RGBA tasks. Experiments demonstrate superior decomposition quality and showcase inherently consistent layer-based editing and multilayer synthesis, signaling a shift toward layer-aware image editing. The work also releases code and models, enabling practical adoption and further research in semantically disentangled image representations.

Abstract

Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose \textbf{Qwen-Image-Layered}, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling \textbf{inherent editability}, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on \href{https://github.com/QwenLM/Qwen-Image-Layered}{https://github.com/QwenLM/Qwen-Image-Layered}

Paper Structure

This paper contains 21 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Qwen-Image-Layered is capable of decomposing an input image into multiple semantically disentangled RGBA layers, thereby enabling inherent editability, where each layer can be independently manipulated without affecting other content.
  • Figure 4: Overview of Qwen-Image-Layered. Left: Illustration of our proposed VLD-MMDiT (Variable Layers Decomposition MMDiT), where the input RGB image and the target RGBA layers are both encoded by our proposed RGBA-VAE. During attention computation, these two sequences are concatenated along the sequence dimension, thereby enhancing inter-layer and intra-layer interactions. Right: Illustration of Layer3D RoPE, where a new layer dimension is introduced to support a variable number of layers.
  • Figure 5: Statistics of the processed multilayer image dataset. (a) Distribution of layer counts before and after merging. (b) Category distribution in the final dataset.
  • Figure 6: Qualitative comparison of Image-to-Multi-RGBA (I2L). The leftmost column shows the input image; the subsequent columns present the decomposed layers. Notably, LayerD suzuki2025layerd exhibits inpainting artifacts (Output Layer 1) and inaccurate segmentation (Output Layer 2 and 3), while our method produces high-quality, semantically disentangled layers, suitable for inherently consistent image editing.
  • Figure 7: Qualitative comparison of image editing. The leftmost column is the input image; prompts are listed above each row. Qwen-Image-Edit-2509 wu2025qwen struggles with resizing and repositioning, tasks inherently supported by Qwen-Image-Layered. Meanwhile, Qwen-Image-Edit-2509 introduces pixel-level shifts (last row), while Qwen-Image-Layered can ensure consistency by editing specific layers.
  • ...and 1 more figures