Table of Contents
Fetching ...

FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning

Zhi Chen, Zecheng Zhao, Yadan Luo, Zi Huang

TL;DR

FastEdit proposes a fast, text-guided single-image editing framework that bypasses text-embedding optimization by using an image-to-image diffusion model conditioned on CLIP-derived image features and target text features. A semantic-aware diffusion fine-tuning strategy selects time-step sets based on the semantic discrepancy between input and target, reducing fine-tuning to 50 iterations, while LoRA reduces trainable parameters to 0.37% of the original model. Empirical results on diverse image-text pairs and the TedBench show competitive or superior editing quality with significantly lower latency (~17s per image) compared to prior methods (~7 minutes), including capabilities like content addition, style transfer, background replacement, and pose manipulation. The approach demonstrates practical impact for rapid, flexible, text-guided image editing with strong fidelity and alignment to user prompts.

Abstract

Conventional Text-guided single-image editing approaches require a two-step process, including fine-tuning the target text embedding for over 1K iterations and the generative model for another 1.5K iterations. Although it ensures that the resulting image closely aligns with both the input image and the target text, this process often requires 7 minutes per image, posing a challenge for practical application due to its time-intensive nature. To address this bottleneck, we introduce FastEdit, a fast text-guided single-image editing method with semantic-aware diffusion fine-tuning, dramatically accelerating the editing process to only 17 seconds. FastEdit streamlines the generative model's fine-tuning phase, reducing it from 1.5K to a mere 50 iterations. For diffusion fine-tuning, we adopt certain time step values based on the semantic discrepancy between the input image and target text. Furthermore, FastEdit circumvents the initial fine-tuning step by utilizing an image-to-image model that conditions on the feature space, rather than the text embedding space. It can effectively align the target text prompt and input image within the same feature space and save substantial processing time. Additionally, we apply the parameter-efficient fine-tuning technique LoRA to U-net. With LoRA, FastEdit minimizes the model's trainable parameters to only 0.37\% of the original size. At the same time, we can achieve comparable editing outcomes with significantly reduced computational overhead. We conduct extensive experiments to validate the editing performance of our approach and show promising editing capabilities, including content addition, style transfer, background replacement, and posture manipulation, etc.

FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning

TL;DR

FastEdit proposes a fast, text-guided single-image editing framework that bypasses text-embedding optimization by using an image-to-image diffusion model conditioned on CLIP-derived image features and target text features. A semantic-aware diffusion fine-tuning strategy selects time-step sets based on the semantic discrepancy between input and target, reducing fine-tuning to 50 iterations, while LoRA reduces trainable parameters to 0.37% of the original model. Empirical results on diverse image-text pairs and the TedBench show competitive or superior editing quality with significantly lower latency (~17s per image) compared to prior methods (~7 minutes), including capabilities like content addition, style transfer, background replacement, and pose manipulation. The approach demonstrates practical impact for rapid, flexible, text-guided image editing with strong fidelity and alignment to user prompts.

Abstract

Conventional Text-guided single-image editing approaches require a two-step process, including fine-tuning the target text embedding for over 1K iterations and the generative model for another 1.5K iterations. Although it ensures that the resulting image closely aligns with both the input image and the target text, this process often requires 7 minutes per image, posing a challenge for practical application due to its time-intensive nature. To address this bottleneck, we introduce FastEdit, a fast text-guided single-image editing method with semantic-aware diffusion fine-tuning, dramatically accelerating the editing process to only 17 seconds. FastEdit streamlines the generative model's fine-tuning phase, reducing it from 1.5K to a mere 50 iterations. For diffusion fine-tuning, we adopt certain time step values based on the semantic discrepancy between the input image and target text. Furthermore, FastEdit circumvents the initial fine-tuning step by utilizing an image-to-image model that conditions on the feature space, rather than the text embedding space. It can effectively align the target text prompt and input image within the same feature space and save substantial processing time. Additionally, we apply the parameter-efficient fine-tuning technique LoRA to U-net. With LoRA, FastEdit minimizes the model's trainable parameters to only 0.37\% of the original size. At the same time, we can achieve comparable editing outcomes with significantly reduced computational overhead. We conduct extensive experiments to validate the editing performance of our approach and show promising editing capabilities, including content addition, style transfer, background replacement, and posture manipulation, etc.
Paper Structure (17 sections, 5 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 5 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: FastEdit – Text-Guided Single-Image Editing in 17 seconds. We show the pairs of 512$\times$512 input images, and the given target texts with corresponding edited results. Compared with the baseline methods, FastEdit fine-tunes only 0.37% parameters for 50 iterations. Arbitrary target texts are supported for the fine-tuned model, due to its embedding optimization-free nature.
  • Figure 2: Different target texts applied to the same images. FastEdit edits the same image differently based on the semantic discrepancy between input image and the target texts.
  • Figure 3: Illustration of FastEdit. Given an input image and a target text, we first project them into features using the CLIP model. Then, we calculate the semantic discrepancy between the two to determine the denoising time steps. Further, we fine-tune the low-rank matrixes added to diffusion model for a few iterations. Lastly, we can interpolate the CLIP's features to generate desired images with the fine-tuned model.
  • Figure 4: Fine-tuning on a single denoising time step leads to texture and structure tradeoff. Low time step values tend to preserve the texture details of the object and result in input image structure distortion. In contrast, high time step values tend to preserve the image structure but lose textural details.
  • Figure 5: Method comparison. We compare Imagic kawar2023imagic, SVDiff han2023svdiff, LoRA hu2021lora with our method. FastEdit successfully applies the desired edit and preserves the original details.
  • ...and 4 more figures