Table of Contents
Fetching ...

Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

Dvir Samuel, Barak Meiri, Haggai Maron, Yoad Tewel, Nir Darshan, Shai Avidan, Gal Chechik, Rami Ben-Ari

TL;DR

GNRI reframes diffusion inversion as scalable root-finding and introduces a guided Newton-Raphson procedure that leverages a diffusion-prior to steer iterations toward in-distribution latents. By solving a scalarized residual with a principled guidance term, GNRI achieves fast convergence (often 1–2 NR steps per time step) and high reconstruction/editing quality without requiring training or prompt optimization. The approach delivers real-time image editing on few-step diffusion models, and improves seed interpolation and rare-concept generation across multiple models (Stable Diffusion, SDXL-Turbo, Flux.1). These results offer a practical, model-agnostic inversion technique suitable for interactive applications and downstream editing tasks.

Abstract

Diffusion inversion is the problem of taking an image and a text prompt that describes it and finding a noise latent that would generate the exact same image. Most current deterministic inversion techniques operate by approximately solving an implicit equation and may converge slowly or yield poor reconstructed images. We formulate the problem by finding the roots of an implicit equation and devlop a method to solve it efficiently. Our solution is based on Newton-Raphson (NR), a well-known technique in numerical analysis. We show that a vanilla application of NR is computationally infeasible while naively transforming it to a computationally tractable alternative tends to converge to out-of-distribution solutions, resulting in poor reconstruction and editing. We therefore derive an efficient guided formulation that fastly converges and provides high-quality reconstructions and editing. We showcase our method on real image editing with three popular open-sourced diffusion models: Stable Diffusion, SDXL-Turbo, and Flux with different deterministic schedulers. Our solution, Guided Newton-Raphson Inversion, inverts an image within 0.4 sec (on an A100 GPU) for few-step models (SDXL-Turbo and Flux.1), opening the door for interactive image editing. We further show improved results in image interpolation and generation of rare objects.

Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

TL;DR

GNRI reframes diffusion inversion as scalable root-finding and introduces a guided Newton-Raphson procedure that leverages a diffusion-prior to steer iterations toward in-distribution latents. By solving a scalarized residual with a principled guidance term, GNRI achieves fast convergence (often 1–2 NR steps per time step) and high reconstruction/editing quality without requiring training or prompt optimization. The approach delivers real-time image editing on few-step diffusion models, and improves seed interpolation and rare-concept generation across multiple models (Stable Diffusion, SDXL-Turbo, Flux.1). These results offer a practical, model-agnostic inversion technique suitable for interactive applications and downstream editing tasks.

Abstract

Diffusion inversion is the problem of taking an image and a text prompt that describes it and finding a noise latent that would generate the exact same image. Most current deterministic inversion techniques operate by approximately solving an implicit equation and may converge slowly or yield poor reconstructed images. We formulate the problem by finding the roots of an implicit equation and devlop a method to solve it efficiently. Our solution is based on Newton-Raphson (NR), a well-known technique in numerical analysis. We show that a vanilla application of NR is computationally infeasible while naively transforming it to a computationally tractable alternative tends to converge to out-of-distribution solutions, resulting in poor reconstruction and editing. We therefore derive an efficient guided formulation that fastly converges and provides high-quality reconstructions and editing. We showcase our method on real image editing with three popular open-sourced diffusion models: Stable Diffusion, SDXL-Turbo, and Flux with different deterministic schedulers. Our solution, Guided Newton-Raphson Inversion, inverts an image within 0.4 sec (on an A100 GPU) for few-step models (SDXL-Turbo and Flux.1), opening the door for interactive image editing. We further show improved results in image interpolation and generation of rare objects.
Paper Structure (25 sections, 19 equations, 15 figures, 6 tables)

This paper contains 25 sections, 19 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Consecutive real image inversions and editing using our GNRI with Flux.1-schnell flux (0.4 sec on an A100 GPU).
  • Figure 2: Newton-Raphson Inversion iterates over an implicit function Eq. \ref{['eq:ddim_implicit']} using Eq. \ref{['eq:nr_scheme']} scheme, at every time-step in the inversion path. Initialized with $z^0_t = z_{t-1}$ it converges within $\approx$ 2 iterations, to $z_t$. Each box denotes one inversion step; black circles correspond to intermediate latents in the denoising process; green circles correspond to intermediate Newton-Raphson iterations.
  • Figure 3: The effect of GNRI guiding term on NR inversion, and comparison to other iterative inversion methods. All results are averages computed with SDXL-Turbo applied to 5,000 COCO images. (a) Average residuals throughout optimization. NR-based methods are the fastest to converge. Gradient-descent was run with the largest learning rate that was stable but still is slowest. (b) Reconstruction quality (PSNR). Adding guidance (blue) to NRI (green) significantly improves the quality of the converged solution. (c) Likelihood of inferred noise. Without the guiding term, NRI (green) finds solutions that are substantially different from those found by other methods, which explains the low reconstruction quality.
  • Figure 4: (Left) Reconstruction qualitative results: Comparing image inversion-reconstruction performance. While all baseline methods struggle to preserve the original image, GNRI successfully excels in accurately reconstructing it. (Right) Inversion Results: Mean reconstruction quality (y-axis, PSNR) and runtime (x-axis, seconds) on the COCO2017 validation set. Our method achieves high PSNR while reducing inversion-reconstruction time by a factor of $\times 2$ (compared to DDIM) and up to $\times 40$ (compared to ExactDPM) on SDXL-turbo and $\times 10$ to $\times 40$ on Flux.1, compared to other approaches.
  • Figure 5: (Left) Qualitative results of image editing. GNRI edits images more naturally while preserving the structure of the original image. All baselines were executed until they reached convergence. (Right) Evaluation of editing performance: GNRI achieves superior CLIP and LPIPS scores, indicating better compliance with text prompts and higher structure preservation.
  • ...and 10 more figures