Table of Contents
Fetching ...

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control

Aosong Feng, Weikang Qiu, Jinbin Bai, Xiao Zhang, Zhen Dong, Kaicheng Zhou, Rex Ying, Leandros Tassiulas

TL;DR

The paper tackles flexible image editing with diffusion models by introducing D-Edit, a framework that achieves item-level control through disentangled cross-attention and unique item prompts. It builds item-prompt associations via a two-step finetuning process and links prompts to image regions to enable precise text-, image-, and mask-based edits, as well as item removal, all within a single architecture. Empirical results across four editing modes show state-of-the-art performance in fidelity and harmony, without requiring captions for the original image, and demonstrate strong generalization across SD and SDXL. The approach is poised to improve practical image editing workflows by offering fine-grained, controllable, and efficient edits that preserve original content while enabling targeted modifications.

Abstract

Building on the success of text-to-image diffusion models (DPMs), image editing is an important application to enable human interaction with AI-generated content. Among various editing methods, editing within the prompt space gains more attention due to its capacity and simplicity of controlling semantics. However, since diffusion models are commonly pretrained on descriptive text captions, direct editing of words in text prompts usually leads to completely different generated images, violating the requirements for image editing. On the other hand, existing editing methods usually consider introducing spatial masks to preserve the identity of unedited regions, which are usually ignored by DPMs and therefore lead to inharmonic editing results. Targeting these two challenges, in this work, we propose to disentangle the comprehensive image-prompt interaction into several item-prompt interactions, with each item linked to a special learned prompt. The resulting framework, named D-Edit, is based on pretrained diffusion models with cross-attention layers disentangled and adopts a two-step optimization to build item-prompt associations. Versatile image editing can then be applied to specific items by manipulating the corresponding prompts. We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal, covering most types of editing applications, all within a single unified framework. Notably, D-Edit is the first framework that can (1) achieve item editing through mask editing and (2) combine image and text-based editing. We demonstrate the quality and versatility of the editing results for a diverse collection of images through both qualitative and quantitative evaluations.

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control

TL;DR

The paper tackles flexible image editing with diffusion models by introducing D-Edit, a framework that achieves item-level control through disentangled cross-attention and unique item prompts. It builds item-prompt associations via a two-step finetuning process and links prompts to image regions to enable precise text-, image-, and mask-based edits, as well as item removal, all within a single architecture. Empirical results across four editing modes show state-of-the-art performance in fidelity and harmony, without requiring captions for the original image, and demonstrate strong generalization across SD and SDXL. The approach is poised to improve practical image editing workflows by offering fine-grained, controllable, and efficient edits that preserve original content while enabling targeted modifications.

Abstract

Building on the success of text-to-image diffusion models (DPMs), image editing is an important application to enable human interaction with AI-generated content. Among various editing methods, editing within the prompt space gains more attention due to its capacity and simplicity of controlling semantics. However, since diffusion models are commonly pretrained on descriptive text captions, direct editing of words in text prompts usually leads to completely different generated images, violating the requirements for image editing. On the other hand, existing editing methods usually consider introducing spatial masks to preserve the identity of unedited regions, which are usually ignored by DPMs and therefore lead to inharmonic editing results. Targeting these two challenges, in this work, we propose to disentangle the comprehensive image-prompt interaction into several item-prompt interactions, with each item linked to a special learned prompt. The resulting framework, named D-Edit, is based on pretrained diffusion models with cross-attention layers disentangled and adopts a two-step optimization to build item-prompt associations. Versatile image editing can then be applied to specific items by manipulating the corresponding prompts. We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal, covering most types of editing applications, all within a single unified framework. Notably, D-Edit is the first framework that can (1) achieve item editing through mask editing and (2) combine image and text-based editing. We demonstrate the quality and versatility of the editing results for a diverse collection of images through both qualitative and quantitative evaluations.
Paper Structure (19 sections, 4 equations, 15 figures, 4 tables)

This paper contains 19 sections, 4 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: The editing pipeline of using D-Edit. The user first uploads an image which is segmented into several items. After finetuning DPMs, the user can do various types of control, including (a) replacing the model with another using a text prompt; (b) refining imperfect details caused by segmentation; (c) moving bags to the ground; (d) replacing the handbag with another one from a reference image; (e) reshaping handbag; (f) resizing the model and handbag; (g) removing background.
  • Figure 2: Comparison of conventional full cross-attention and grouped cross-attention. Query, key, and value are shown as one-dimensional vectors. For grouped cross-attention, each item (corresponding to certain pixels/patches) only attends to the text prompt (two tokens) assigned to it.
  • Figure 3: Embedding layer in the text encoder. New tokens are inserted with random initialization.
  • Figure 4: Operations needed for different types of image editing. Each colored item has a unique prompt p.
  • Figure 5: Text-guided editing. D-Edit enables selection of any item segmentation and edit using text prompt.
  • ...and 10 more figures