Table of Contents
Fetching ...

Forgedit: Text Guided Image Editing via Learning and Forgetting

Shiwen Zhang, Shuai Xiao, Weilin Huang

TL;DR

Forgedit tackles the challenge of text-guided image editing with only an original image and a target prompt by coupling a vision-language joint fine-tuning stage with a novel editing mechanism. It introduces vector projection to combine a learned source embedding with the target prompt embedding and reveals a UNet-based forgetting strategy that selectively preserves or discards encoder/decoder parameters to mitigate overfitting during sampling. The method achieves state-of-the-art results on TE dBench, outperforming Imagic with Imagen in CLIP and LPIPS metrics, and runs reconstruction in about 30 seconds on a single A100 GPU, enabling fast, flexible editing and potential applications in visual storytelling. The framework is compatible with Stable Diffusion 1.4 and can extend to other fine-tuning-based editing pipelines, highlighting its practical impact for robust and controllable image editing. Forgedit thus provides a fast, data-efficient path to high-fidelity, non-rigid edits while addressing key overfitting challenges in diffusion-model fine-tuning.

Abstract

Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. In this paper, we design a novel text-guided image editing method, named as Forgedit. First, we propose a vision-language joint optimization framework capable of reconstructing the original image in 30 seconds, much faster than previous SOTA and much less overfitting. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, which is capable to control the identity similarity and editing strength seperately. Finally, we discovered a general property of UNet in Diffusion Models, i.e., Unet encoder learns space and structure, Unet decoder learns appearance and identity. With such a property, we design forgetting mechanisms to successfully tackle the fatal and inevitable overfitting issues when fine-tuning Diffusion Models on one image, thus significantly boosting the editing capability of Diffusion Models. Our method, Forgedit, built on Stable Diffusion, achieves new state-of-the-art results on the challenging text-guided image editing benchmark: TEdBench, surpassing the previous SOTA methods such as Imagic with Imagen, in terms of both CLIP score and LPIPS score. Codes are available at https://github.com/witcherofresearch/Forgedit

Forgedit: Text Guided Image Editing via Learning and Forgetting

TL;DR

Forgedit tackles the challenge of text-guided image editing with only an original image and a target prompt by coupling a vision-language joint fine-tuning stage with a novel editing mechanism. It introduces vector projection to combine a learned source embedding with the target prompt embedding and reveals a UNet-based forgetting strategy that selectively preserves or discards encoder/decoder parameters to mitigate overfitting during sampling. The method achieves state-of-the-art results on TE dBench, outperforming Imagic with Imagen in CLIP and LPIPS metrics, and runs reconstruction in about 30 seconds on a single A100 GPU, enabling fast, flexible editing and potential applications in visual storytelling. The framework is compatible with Stable Diffusion 1.4 and can extend to other fine-tuning-based editing pipelines, highlighting its practical impact for robust and controllable image editing. Forgedit thus provides a fast, data-efficient path to high-fidelity, non-rigid edits while addressing key overfitting challenges in diffusion-model fine-tuning.

Abstract

Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. In this paper, we design a novel text-guided image editing method, named as Forgedit. First, we propose a vision-language joint optimization framework capable of reconstructing the original image in 30 seconds, much faster than previous SOTA and much less overfitting. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, which is capable to control the identity similarity and editing strength seperately. Finally, we discovered a general property of UNet in Diffusion Models, i.e., Unet encoder learns space and structure, Unet decoder learns appearance and identity. With such a property, we design forgetting mechanisms to successfully tackle the fatal and inevitable overfitting issues when fine-tuning Diffusion Models on one image, thus significantly boosting the editing capability of Diffusion Models. Our method, Forgedit, built on Stable Diffusion, achieves new state-of-the-art results on the challenging text-guided image editing benchmark: TEdBench, surpassing the previous SOTA methods such as Imagic with Imagen, in terms of both CLIP score and LPIPS score. Codes are available at https://github.com/witcherofresearch/Forgedit
Paper Structure (17 sections, 9 equations, 13 figures, 1 table)

This paper contains 17 sections, 9 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Forgedit could be used for consistent and controllable keyframe generation for visual storytelling and movie generation, given one input image and target prompts. We list several samples with different random seeds for each target prompt. We demonstrate Forgedit is capable of controling multiple characters performing various actions at different scenes. Forgedit could also control each different character seperately. Forgetting strategy on UNet's encoder with vector subtraction leads to high flexibility and success rate to change the spatial structures and actions, preserving appearance and identity by reserving UNet's decoder.
  • Figure 2: Overall framework of our Forgedit, consisting of a vision-language joint fine-tuning stage and an editing stage. We use BLIP to generate a text description of an original image, and compute an embedding of the source text $e_{src}$ using a CLIP text encoder. The source embedding $e_{src}$ is then jointly optimized with UNet using different learning rates for text embedding and UNet, where the deep layers of UNet are frozen. During the editing process, we merge the source embedding $e_{src}$ and the target embedding $e_{tgt}$ with vector subtraction or projection to get a final text embedding $e$. With our forgetting strategies applied to UNet, we utilize DDIM sampling to get the final edited image.
  • Figure 3: We demonstrate vector subtraction and vector projection to merge $e_{src}$ and $e_{tgt}$. Vector subtraction could lead to inconsistent appearance of the object being edited since it cannot directly control the importance of $e_ {src}$. The vector projection decomposes the $e_{tgt}$ into $re_{src}$ along $e_{src}$ and $e_{edit}$ orthogonal to $e_{src}$. We can directly control the scales of $e_{src}$ and $e_{edit}$ by summation.
  • Figure 4: The encoder of UNets learn features related to pose, angle, structure and position. The decoder are related to appearance and texture. Thus we design a forgetting strategy according to the editing target.
  • Figure 5: Forgedit Workflow.
  • ...and 8 more figures