TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models
Gilad Deutch, Rinon Gal, Daniel Garibi, Or Patashnik, Daniel Cohen-Or
TL;DR
TurboEdit tackles the challenge of fast, text-based image editing using few-step diffusion models by diagnosing artifacts and weak editing strength in edit-friendly DDPM inversion and proposing two key fixes: a time-shifted denoising schedule to align inverted-noise statistics, and a pseudo-guidance term to amplify prompt-driven edits. It further reveals a deep equivalence between edit-friendly inversion (EF) and Delta-Denoising Score (DDS) under certain timesteps and learning rates, enabling faster, unified denoising with fewer evaluations. The approach enables real-image editing in as few as $3$ diffusion steps on SDXL-Turbo, achieving sub-second editing with speedups of $\times5$ to $\times500$ while maintaining competitive quality and prompt alignment. These contributions offer a practical, scalable solution for interactive image editing and provide deeper insights into the mechanisms of few-step diffusion editing. The work suggests avenues for further improvements in noise schedule alignment and geometric editing capabilities in fast-sampling diffusion models.
Abstract
Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast-sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the ``edit-friendly'' DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We trace the artifacts to mismatched noise statistics between inverted noises and the expected noise schedule, and suggest a shifted noise schedule which corrects for this offset. To increase editing strength, we propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts. All in all, our method enables text-based image editing with as few as three diffusion steps, while providing novel insights into the mechanisms behind popular text-based editing approaches.
