Table of Contents
Fetching ...

S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control

Xudong Liu, Zikun Chen, Ruowei Jiang, Ziyi Wu, Kejia Yin, Han Zhao, Parham Aarabi, Igor Gilitschenski

TL;DR

The paper tackles the challenge of precise, identity-preserving text-guided image editing with diffusion models, where naïve editing often distorts identity or entangles attributes. It introduces S$^2$Edit, a two-stage framework that learns an identity token [I] through identity-focused fine-tuning and enforces semantic orthogonality and spatially constrained cross-attention to localize the token's influence. It further extends to compositional editing by learning an attribute token [A] and composing prompts to transfer attributes such as makeup while maintaining source identity, achieving superior qualitative and quantitative results on diverse datasets. The work demonstrates strong generalization to non-face domains and highlights practical impact for controlled, fine-grained edits, while noting limitations such as dependency on a source prompt and ethical considerations around misuse.

Abstract

Recent advances in diffusion models have enabled high-quality generation and manipulation of images guided by texts, as well as concept learning from images. However, naive applications of existing methods to editing tasks that require fine-grained control, e.g., face editing, often lead to suboptimal solutions with identity information and high-frequency details lost during the editing process, or irrelevant image regions altered due to entangled concepts. In this work, we propose S$^2$Edit, a novel method based on a pre-trained text-to-image diffusion model that enables personalized editing with precise semantic and spatial control. We first fine-tune our model to embed the identity information into a learnable text token. During fine-tuning, we disentangle the learned identity token from attributes to be edited by enforcing an orthogonality constraint in the textual feature space. To ensure that the identity token only affects regions of interest, we apply object masks to guide the cross-attention maps. At inference time, our method performs localized editing while faithfully preserving the original identity with semantically disentangled and spatially focused identity token learned. Extensive experiments demonstrate the superiority of S$^2$Edit over state-of-the-art methods both quantitatively and qualitatively. Additionally, we showcase several compositional image editing applications of S$^2$Edit such as makeup transfer.

S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control

TL;DR

The paper tackles the challenge of precise, identity-preserving text-guided image editing with diffusion models, where naïve editing often distorts identity or entangles attributes. It introduces SEdit, a two-stage framework that learns an identity token [I] through identity-focused fine-tuning and enforces semantic orthogonality and spatially constrained cross-attention to localize the token's influence. It further extends to compositional editing by learning an attribute token [A] and composing prompts to transfer attributes such as makeup while maintaining source identity, achieving superior qualitative and quantitative results on diverse datasets. The work demonstrates strong generalization to non-face domains and highlights practical impact for controlled, fine-grained edits, while noting limitations such as dependency on a source prompt and ethical considerations around misuse.

Abstract

Recent advances in diffusion models have enabled high-quality generation and manipulation of images guided by texts, as well as concept learning from images. However, naive applications of existing methods to editing tasks that require fine-grained control, e.g., face editing, often lead to suboptimal solutions with identity information and high-frequency details lost during the editing process, or irrelevant image regions altered due to entangled concepts. In this work, we propose SEdit, a novel method based on a pre-trained text-to-image diffusion model that enables personalized editing with precise semantic and spatial control. We first fine-tune our model to embed the identity information into a learnable text token. During fine-tuning, we disentangle the learned identity token from attributes to be edited by enforcing an orthogonality constraint in the textual feature space. To ensure that the identity token only affects regions of interest, we apply object masks to guide the cross-attention maps. At inference time, our method performs localized editing while faithfully preserving the original identity with semantically disentangled and spatially focused identity token learned. Extensive experiments demonstrate the superiority of SEdit over state-of-the-art methods both quantitatively and qualitatively. Additionally, we showcase several compositional image editing applications of SEdit such as makeup transfer.

Paper Structure

This paper contains 12 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Impact of prompts on editing results. The results are guided by the prompts listed above.
  • Figure 2: S$^2$Edit overview.Left: Given a source image and a text prompt, we insert a learnable token [I] into the text prompt and fine-tune a pre-trained text-to-image diffusion model to learn the identity information. To obtain a disentangled identity token, we apply an orthogonality constraint in the text embedding space via a semantic loss $L_{semantic}$ and force [I] to only represent the object of interest with masked cross-attention. Right: With [I] learned, we freeze the fine-tuned model and perform Null-text Inversion mokady2023null to get an initial noise map, then denoise it conditioned on the target prompt to generate the editing result.
  • Figure 3: Qualitative comparison of text-guided image editing in the face domain. The target prompts are listed under each row. S$^2$Edit outperforms state-of-the-art methods significantly with accurate and faithful edits that align well with the editing prompts while preserving identity information. Prompt details are provided in Appendix A.3
  • Figure 4: Fine-grained editing results of S$^2$Edit on the same image for various attributes. Full prompts used are provided in Appendix A.3.
  • Figure 5: Editing results of S$^2$Edit on cat (left) and church images (right).
  • ...and 4 more figures