Table of Contents
Fetching ...

Text Guided Image Editing with Automatic Concept Locating and Forgetting

Jia Li, Lijie Hu, Zhixian He, Jingfeng Zhang, Tianhang Zheng, Di Wang

TL;DR

The paper addresses semantic misalignment in text-guided image editing with diffusion models by automatically locating target concepts in the input image through scene descriptions and dependency parsing, then applying negative forgetting during denoising to realign edits with the textual instruction. The proposed Locate and Forget (LaF) framework comprises a two-stage process: (1) align the input prompt with the image context to identify edit targets, and (2) perform selective forgetting via negative guidance during diffusion to suppress unintended content. Empirical results on TedBench and MagicBrush show substantial gains in alignment (CLIP-T) and realism (IS), with ablations highlighting an optimal forgetting strength; a human study further corroborates LaF’s superior editing quality. The work advances controllable, language-driven image editing by reducing reliance on manual localization and improving semantic coherence across diverse datasets, while also acknowledging limitations in quantifying certain numeric edits and broader societal implications.

Abstract

With the advancement of image-to-image diffusion models guided by text, significant progress has been made in image editing. However, a persistent challenge remains in seamlessly incorporating objects into images based on textual instructions, without relying on extra user-provided guidance. Text and images are inherently distinct modalities, bringing out difficulties in fully capturing the semantic intent conveyed through language and accurately translating that into the desired visual modifications. Therefore, text-guided image editing models often produce generations with residual object attributes that do not fully align with human expectations. To address this challenge, the models should comprehend the image content effectively away from a disconnect between the provided textual editing prompts and the actual modifications made to the image. In our paper, we propose a novel method called Locate and Forget (LaF), which effectively locates potential target concepts in the image for modification by comparing the syntactic trees of the target prompt and scene descriptions in the input image, intending to forget their existence clues in the generated image. Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.

Text Guided Image Editing with Automatic Concept Locating and Forgetting

TL;DR

The paper addresses semantic misalignment in text-guided image editing with diffusion models by automatically locating target concepts in the input image through scene descriptions and dependency parsing, then applying negative forgetting during denoising to realign edits with the textual instruction. The proposed Locate and Forget (LaF) framework comprises a two-stage process: (1) align the input prompt with the image context to identify edit targets, and (2) perform selective forgetting via negative guidance during diffusion to suppress unintended content. Empirical results on TedBench and MagicBrush show substantial gains in alignment (CLIP-T) and realism (IS), with ablations highlighting an optimal forgetting strength; a human study further corroborates LaF’s superior editing quality. The work advances controllable, language-driven image editing by reducing reliance on manual localization and improving semantic coherence across diverse datasets, while also acknowledging limitations in quantifying certain numeric edits and broader societal implications.

Abstract

With the advancement of image-to-image diffusion models guided by text, significant progress has been made in image editing. However, a persistent challenge remains in seamlessly incorporating objects into images based on textual instructions, without relying on extra user-provided guidance. Text and images are inherently distinct modalities, bringing out difficulties in fully capturing the semantic intent conveyed through language and accurately translating that into the desired visual modifications. Therefore, text-guided image editing models often produce generations with residual object attributes that do not fully align with human expectations. To address this challenge, the models should comprehend the image content effectively away from a disconnect between the provided textual editing prompts and the actual modifications made to the image. In our paper, we propose a novel method called Locate and Forget (LaF), which effectively locates potential target concepts in the image for modification by comparing the syntactic trees of the target prompt and scene descriptions in the input image, intending to forget their existence clues in the generated image. Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.
Paper Structure (21 sections, 7 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Original image presents a red car. When the input text instruction is an image of a yellow bus, Stable Diffusion focuses on modifying the color but preserves the old shape. By analyzing the scene description of the image, concepts that users intend to edit are located and forgotten in the denoising steps for an improved output.
  • Figure 2: Our framework for our method Locate and Forget (LaF). LaF consists of two parts: 1) Alignment of the input text prompt with the visual scene information: By comparing the textual instructions to the scene description of the image's contents, the LaF model can identify the specific concepts and attributes in the visual scene that need to be edited. 2) Selective forgetting during the diffusion process: During the denoising steps, identified forgettable elements as a form of negative guidance to be removed, which allows to selectively forget the influence of the visual elements that are not aligned with the user's intent.
  • Figure 3: Visual comparisons between Hive, IP2P, SD and our method in dataset MagicBrush. The red annotations indicate the visual concepts in the original image that need to be edited, while the blue annotations represent the new visual elements that should be introduced based on the provided textual prompt.
  • Figure 4: Visual editing results under Varying Forgetting Guidance $\eta$. Example images are respectively "Skillet filled with salami, broccoli and other vegetables." and "A tennis ball on a tennis court."
  • Figure 5: Impact of Forgetting Guidance values on CLIP-T across different Datasets
  • ...and 4 more figures