Table of Contents
Fetching ...

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

Gilad Deutch, Rinon Gal, Daniel Garibi, Or Patashnik, Daniel Cohen-Or

TL;DR

TurboEdit tackles the challenge of fast, text-based image editing using few-step diffusion models by diagnosing artifacts and weak editing strength in edit-friendly DDPM inversion and proposing two key fixes: a time-shifted denoising schedule to align inverted-noise statistics, and a pseudo-guidance term to amplify prompt-driven edits. It further reveals a deep equivalence between edit-friendly inversion (EF) and Delta-Denoising Score (DDS) under certain timesteps and learning rates, enabling faster, unified denoising with fewer evaluations. The approach enables real-image editing in as few as $3$ diffusion steps on SDXL-Turbo, achieving sub-second editing with speedups of $\times5$ to $\times500$ while maintaining competitive quality and prompt alignment. These contributions offer a practical, scalable solution for interactive image editing and provide deeper insights into the mechanisms of few-step diffusion editing. The work suggests avenues for further improvements in noise schedule alignment and geometric editing capabilities in fast-sampling diffusion models.

Abstract

Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast-sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the ``edit-friendly'' DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We trace the artifacts to mismatched noise statistics between inverted noises and the expected noise schedule, and suggest a shifted noise schedule which corrects for this offset. To increase editing strength, we propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts. All in all, our method enables text-based image editing with as few as three diffusion steps, while providing novel insights into the mechanisms behind popular text-based editing approaches.

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

TL;DR

TurboEdit tackles the challenge of fast, text-based image editing using few-step diffusion models by diagnosing artifacts and weak editing strength in edit-friendly DDPM inversion and proposing two key fixes: a time-shifted denoising schedule to align inverted-noise statistics, and a pseudo-guidance term to amplify prompt-driven edits. It further reveals a deep equivalence between edit-friendly inversion (EF) and Delta-Denoising Score (DDS) under certain timesteps and learning rates, enabling faster, unified denoising with fewer evaluations. The approach enables real-image editing in as few as diffusion steps on SDXL-Turbo, achieving sub-second editing with speedups of to while maintaining competitive quality and prompt alignment. These contributions offer a practical, scalable solution for interactive image editing and provide deeper insights into the mechanisms of few-step diffusion editing. The work suggests avenues for further improvements in noise schedule alignment and geometric editing capabilities in fast-sampling diffusion models.

Abstract

Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast-sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the ``edit-friendly'' DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We trace the artifacts to mismatched noise statistics between inverted noises and the expected noise schedule, and suggest a shifted noise schedule which corrects for this offset. To increase editing strength, we propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts. All in all, our method enables text-based image editing with as few as three diffusion steps, while providing novel insights into the mechanisms behind popular text-based editing approaches.
Paper Structure (19 sections, 26 equations, 12 figures, 2 tables)

This paper contains 19 sections, 26 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Comparison of the pixel-wise standard deviations of inverted noise maps, and the expected distribution. The scale of corrections predicted by standard edit-friendly DDPM inversion (red, \ref{['eq:edit_friendly_corrections']}) is consistently higher than the expected noise schedule (green). The higher values approximately align with a shift along the x-axis: i.e., edit-friendly noise scales align with earlier steps in the diffusion process. We thus propose a time-shifted inversion schedule, where the image is cleaned "as-if" it belonged to a time-point aligning with its noise scale, rather than the real step. In practice, shifting the schedule by a constant $200$ steps serves to provide good alignment (blue) and resolve most artifacts. To correct the statistics of the last step, we further apply norm-clipping to the predicted noise at that stage (purple). Shaded regions indicate the 68% confidence interval.
  • Figure 2: We show the effect of increasing the strength of the cross-prompt term ($w_p$) and cross-trajectory term ($w_t$) in the DDPM inversion. While both terms can help increase the condition in the edited image, as we increase the cross-trajectory term we see artifacts and saturation.
  • Figure 3: Qualitative editing results of our method. All results use $4$ diffusion steps.
  • Figure 4: Comparisons against multi-step editing methods. Our results are on-par with existing baselines, while being $x5$-$x300$ faster.
  • Figure 5: Comparisons with few-step methods. Our method can better preserve the content of the original image, while applying meaningful edits.
  • ...and 7 more figures