Table of Contents
Fetching ...

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning

Fengyi Fu, Mengqi Huang, Lei Zhang, Zhendong Mao

TL;DR

LayerEdit introduces a training-free, multi-layer approach to text-driven multi-object editing that explicitly models inter-object conflicts. It decomposes the scene into object layers using conflict-aware decomposition, applies per-object textual and geometric edits via an extended multi-layer diffusion framework, and fuses layers through transparency-guided fusion to maintain structural coherence. By leveraging an attention-aware IoU mechanism and a time-dependent region removal strategy, LayerEdit achieves improved intra-object controllability and inter-object coherence, validated on OIR-bench and LoMOE-bench with significant gains in CLIP-T, LPIPS, and FID. The method is plug-and-play on modern backbones like SDXL and demonstrates strong geometric editing capabilities, enabling region-unconstrained manipulations and occlusion handling while maintaining global image quality.

Abstract

Text-driven multi-object image editing which aims to precisely modify multiple objects within an image based on text descriptions, has recently attracted considerable interest. Existing works primarily follow the localize-editing paradigm, focusing on independent object localization and editing while neglecting critical inter-object interactions. However, this work points out that the neglected attention entanglements in inter-object conflict regions, inherently hinder disentangled multi-object editing, leading to either inter-object editing leakage or intra-object editing constraints. We thereby propose a novel multi-layer disentangled editing framework LayerEdit, a training-free method which, for the first time, through precise object-layered decomposition and coherent fusion, enables conflict-free object-layered editing. Specifically, LayerEdit introduces a novel "decompose-editingfusion" framework, consisting of: (1) Conflict-aware Layer Decomposition module, which utilizes an attention-aware IoU scheme and time-dependent region removing, to enhance conflict awareness and suppression for layer decomposition. (2) Object-layered Editing module, to establish coordinated intra-layer text guidance and cross-layer geometric mapping, achieving disentangled semantic and structural modifications. (3) Transparency-guided Layer Fusion module, to facilitate structure-coherent inter-object layer fusion through precise transparency guidance learning. Extensive experiments verify the superiority of LayerEdit over existing methods, showing unprecedented intra-object controllability and inter-object coherence in complex multi-object scenarios. Codes are available at: https://github.com/fufy1024/LayerEdit.

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning

TL;DR

LayerEdit introduces a training-free, multi-layer approach to text-driven multi-object editing that explicitly models inter-object conflicts. It decomposes the scene into object layers using conflict-aware decomposition, applies per-object textual and geometric edits via an extended multi-layer diffusion framework, and fuses layers through transparency-guided fusion to maintain structural coherence. By leveraging an attention-aware IoU mechanism and a time-dependent region removal strategy, LayerEdit achieves improved intra-object controllability and inter-object coherence, validated on OIR-bench and LoMOE-bench with significant gains in CLIP-T, LPIPS, and FID. The method is plug-and-play on modern backbones like SDXL and demonstrates strong geometric editing capabilities, enabling region-unconstrained manipulations and occlusion handling while maintaining global image quality.

Abstract

Text-driven multi-object image editing which aims to precisely modify multiple objects within an image based on text descriptions, has recently attracted considerable interest. Existing works primarily follow the localize-editing paradigm, focusing on independent object localization and editing while neglecting critical inter-object interactions. However, this work points out that the neglected attention entanglements in inter-object conflict regions, inherently hinder disentangled multi-object editing, leading to either inter-object editing leakage or intra-object editing constraints. We thereby propose a novel multi-layer disentangled editing framework LayerEdit, a training-free method which, for the first time, through precise object-layered decomposition and coherent fusion, enables conflict-free object-layered editing. Specifically, LayerEdit introduces a novel "decompose-editingfusion" framework, consisting of: (1) Conflict-aware Layer Decomposition module, which utilizes an attention-aware IoU scheme and time-dependent region removing, to enhance conflict awareness and suppression for layer decomposition. (2) Object-layered Editing module, to establish coordinated intra-layer text guidance and cross-layer geometric mapping, achieving disentangled semantic and structural modifications. (3) Transparency-guided Layer Fusion module, to facilitate structure-coherent inter-object layer fusion through precise transparency guidance learning. Extensive experiments verify the superiority of LayerEdit over existing methods, showing unprecedented intra-object controllability and inter-object coherence in complex multi-object scenarios. Codes are available at: https://github.com/fufy1024/LayerEdit.

Paper Structure

This paper contains 26 sections, 14 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Illustration of motivation. (a) Existing paradigm: suffers from inaccurate disentanglement across conflict regions, resulting in (a-1) inter-object editing leakage or (a-2) intra-object editing artifacts. (b) Our multi-layer disentangled editing paradigm: by modeling conflict-aware object-layered decomposition and structure-coherent fusion, bringing conflict-free and accurate multi-object editing.
  • Figure 2: Visualization of cross-attention maps obtained from SDXL, associated with different tokens. The key observation is that model exhibits text-image attention misalignment in both: (a) semantical, and (b) spatial conflict regions.
  • Figure 3: Overview of LayerEdit, consisting of: 1) Conflict-aware Layer Decomposition: precisely decompose object layers by identifying and constraining conflict regions; 2) Object-layered Editing: establish intra-layer text-guided editing (geometric editing detailed in Fig.\ref{['ge-method']}); 3) Transparency-guided Layer Fusion: enables structure-coherent fusion with transparency learning.
  • Figure 4: Visualization of attention features from earlier to later timesteps, driven by time-dependent region removing.
  • Figure 5: Diagram of centroid-aligned mapping principle. ($c_h,c_w$) is object centroid. Displacement $(\triangle h, \triangle w)$ and scale $s$ are additional input controls for moving/resizing.
  • ...and 11 more figures