FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-Tuning
Zhi Chen, Zecheng Zhao, Yadan Luo, Zi Huang
TL;DR
FastEdit proposes a fast, text-guided single-image editing framework that bypasses text-embedding optimization by using an image-to-image diffusion model conditioned on CLIP-derived image features and target text features. A semantic-aware diffusion fine-tuning strategy selects time-step sets based on the semantic discrepancy between input and target, reducing fine-tuning to 50 iterations, while LoRA reduces trainable parameters to 0.37% of the original model. Empirical results on diverse image-text pairs and the TedBench show competitive or superior editing quality with significantly lower latency (~17s per image) compared to prior methods (~7 minutes), including capabilities like content addition, style transfer, background replacement, and pose manipulation. The approach demonstrates practical impact for rapid, flexible, text-guided image editing with strong fidelity and alignment to user prompts.
Abstract
Conventional Text-guided single-image editing approaches require a two-step process, including fine-tuning the target text embedding for over 1K iterations and the generative model for another 1.5K iterations. Although it ensures that the resulting image closely aligns with both the input image and the target text, this process often requires 7 minutes per image, posing a challenge for practical application due to its time-intensive nature. To address this bottleneck, we introduce FastEdit, a fast text-guided single-image editing method with semantic-aware diffusion fine-tuning, dramatically accelerating the editing process to only 17 seconds. FastEdit streamlines the generative model's fine-tuning phase, reducing it from 1.5K to a mere 50 iterations. For diffusion fine-tuning, we adopt certain time step values based on the semantic discrepancy between the input image and target text. Furthermore, FastEdit circumvents the initial fine-tuning step by utilizing an image-to-image model that conditions on the feature space, rather than the text embedding space. It can effectively align the target text prompt and input image within the same feature space and save substantial processing time. Additionally, we apply the parameter-efficient fine-tuning technique LoRA to U-net. With LoRA, FastEdit minimizes the model's trainable parameters to only 0.37\% of the original size. At the same time, we can achieve comparable editing outcomes with significantly reduced computational overhead. We conduct extensive experiments to validate the editing performance of our approach and show promising editing capabilities, including content addition, style transfer, background replacement, and posture manipulation, etc.
