Table of Contents
Fetching ...

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

Yueru Jia, Yuhui Yuan, Aosong Cheng, Chuke Wang, Ji Li, Huizhu Jia, Shanghang Zhang

TL;DR

DesignEdit tackles precise spatial-aware image editing by converting edits into multi-layer latent decomposition and fusion within a diffusion-based pipeline. It introduces a key-masking self-attention mechanism for reliable background inpainting and an artifact suppression module, augmented by GPT-4V-driven instruction planning for layer-wise edits. The approach achieves competitive results with state-of-the-art guided editors while requiring no training or finetuning, and it supports a broad set of tasks including object movement, resizing, removal, and cross-image composition. Overall, the work offers a practical, modular framework that improves editing accuracy and flexibility in design-image workflows.

Abstract

Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

TL;DR

DesignEdit tackles precise spatial-aware image editing by converting edits into multi-layer latent decomposition and fusion within a diffusion-based pipeline. It introduces a key-masking self-attention mechanism for reliable background inpainting and an artifact suppression module, augmented by GPT-4V-driven instruction planning for layer-wise edits. The approach achieves competitive results with state-of-the-art guided editors while requiring no training or finetuning, and it supports a broad set of tasks including object movement, resizing, removal, and cross-image composition. Overall, the work offers a practical, modular framework that improves editing accuracy and flexibility in design-image workflows.

Abstract

Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we adopt the concept of layers from the design domain to manipulate objects flexibly with various operations. The key insight is to transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. First, we segment the latent representations of the source images into multiple layers, which include several object layers and one incomplete background layer that necessitates reliable inpainting. To avoid extra tuning, we further explore the inner inpainting ability within the self-attention mechanism. We introduce a key-masking self-attention scheme that can propagate the surrounding context information into the masked region while mitigating its impact on the regions outside the mask. Second, we propose an instruction-guided latent fusion that pastes the multi-layered latent representations onto a canvas latent. We also introduce an artifact suppression scheme in the latent space to enhance the inpainting quality. Due to the inherent modular advantages of such multi-layered representations, we can achieve accurate image editing, and we demonstrate that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor. Last, we show that our approach is a unified framework that supports various accurate image editing tasks on more than six different editing tasks.
Paper Structure (13 sections, 6 equations, 20 figures, 2 tables)

This paper contains 13 sections, 6 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Examples of visual design image editing. Our approach facilitates a range of image editing operations with a training-free and unified framework to achieve accurate spatial-aware editing of the design image. Our approach is able to manipulate different objects simultaneously, as well as implement various operations at the same time. All results are produced using one diffusion denoising process.
  • Figure 2: Comparison between our method against Self-Guidance and DiffEditor. We report the win-rate comparison across image quality and edit accuracy in (a). For each comparison, we select 10 examples with multiple operations like movement and resizing. Users were asked to vote from two aspects, image quality and edit accuracy. The "Draw" option represents equal effect. We collect answers from 73 users, with a total of 1460 votes for each metric.
  • Figure 3: Illustrating the overall framework of our approach: During the multi-layered decomposition stage, given a user's editing instruction and the source image, we first utilize GPT-4V to perform instruction planning, generating a set of detailed layer-wise editing instructions. Then, we segment the source image into multiple image layers, including the background layer that requires additional inpainting, implemented by a novel key-masking self-attention scheme, and the other object layers of the object to manipulate. For the multi-layered fusion stage, We follow the layers' orders and layer-wise instructions sequentially to paste them onto the canvas in latent space. We further apply multiple denoising steps to harmonize the fused multi-layered latent representations. Additionally, we perform artifact suppression to improve the background inpainting quality.
  • Figure 4: Key-Masking Self-Attention Mechanism at time step $\textbf{t}$. The figure shows the diagram for the removal latent ${\bf Z}_t^\mathcal{S}$ at timestep $t$. The surroundings of pixel features are kept by the source latent ${\bf Z}_t^\mathcal{S}$. ${\bf M}_\mathsf{remove}$ and ${\bf M}_\mathsf{refine}$ are utilized on key features to reduce attention within the mask.
  • Figure 5: Illustrating the Key-Masking Self-Attention Mechanism. (a) shows that regions inside the mask query only from the regions outside the mask, which are copied from the source latent to complete the information. (b) presents the output heatmaps changing over time from the source and removal latent. The maps come from the first self-attention block at a resolution of $64 \times 64$ .
  • ...and 15 more figures