Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

Dvir Samuel; Barak Meiri; Haggai Maron; Yoad Tewel; Nir Darshan; Shai Avidan; Gal Chechik; Rami Ben-Ari

Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

Dvir Samuel, Barak Meiri, Haggai Maron, Yoad Tewel, Nir Darshan, Shai Avidan, Gal Chechik, Rami Ben-Ari

TL;DR

GNRI reframes diffusion inversion as scalable root-finding and introduces a guided Newton-Raphson procedure that leverages a diffusion-prior to steer iterations toward in-distribution latents. By solving a scalarized residual with a principled guidance term, GNRI achieves fast convergence (often 1–2 NR steps per time step) and high reconstruction/editing quality without requiring training or prompt optimization. The approach delivers real-time image editing on few-step diffusion models, and improves seed interpolation and rare-concept generation across multiple models (Stable Diffusion, SDXL-Turbo, Flux.1). These results offer a practical, model-agnostic inversion technique suitable for interactive applications and downstream editing tasks.

Abstract

Diffusion inversion is the problem of taking an image and a text prompt that describes it and finding a noise latent that would generate the exact same image. Most current deterministic inversion techniques operate by approximately solving an implicit equation and may converge slowly or yield poor reconstructed images. We formulate the problem by finding the roots of an implicit equation and devlop a method to solve it efficiently. Our solution is based on Newton-Raphson (NR), a well-known technique in numerical analysis. We show that a vanilla application of NR is computationally infeasible while naively transforming it to a computationally tractable alternative tends to converge to out-of-distribution solutions, resulting in poor reconstruction and editing. We therefore derive an efficient guided formulation that fastly converges and provides high-quality reconstructions and editing. We showcase our method on real image editing with three popular open-sourced diffusion models: Stable Diffusion, SDXL-Turbo, and Flux with different deterministic schedulers. Our solution, Guided Newton-Raphson Inversion, inverts an image within 0.4 sec (on an A100 GPU) for few-step models (SDXL-Turbo and Flux.1), opening the door for interactive image editing. We further show improved results in image interpolation and generation of rare objects.

Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

TL;DR

Abstract

Paper Structure (25 sections, 19 equations, 15 figures, 6 tables)

This paper contains 25 sections, 19 equations, 15 figures, 6 tables.

Introduction
Related work
Preliminaries
Our method: Guided Newton Raphson Inversion
Framing inversion as efficient root-finding
Guided Newton Raphson Inversion
Experiments
Guidance Term Motivation and Comparison to Other Numerical Schemes
Image Reconstruction
Real-Time Image Editing
Seed Interpolation and Rare Concept Generation
Summary
Newton method for Multivariable Scalar Function
Discussion on the Newton-Raphson method for functions without zero crossings
GNRI Scheme Contraction Mapping
...and 10 more sections

Figures (15)

Figure 1: Consecutive real image inversions and editing using our GNRI with Flux.1-schnell flux (0.4 sec on an A100 GPU).
Figure 2: Newton-Raphson Inversion iterates over an implicit function Eq. \ref{['eq:ddim_implicit']} using Eq. \ref{['eq:nr_scheme']} scheme, at every time-step in the inversion path. Initialized with $z^0_t = z_{t-1}$ it converges within $\approx$ 2 iterations, to $z_t$. Each box denotes one inversion step; black circles correspond to intermediate latents in the denoising process; green circles correspond to intermediate Newton-Raphson iterations.
Figure 3: The effect of GNRI guiding term on NR inversion, and comparison to other iterative inversion methods. All results are averages computed with SDXL-Turbo applied to 5,000 COCO images. (a) Average residuals throughout optimization. NR-based methods are the fastest to converge. Gradient-descent was run with the largest learning rate that was stable but still is slowest. (b) Reconstruction quality (PSNR). Adding guidance (blue) to NRI (green) significantly improves the quality of the converged solution. (c) Likelihood of inferred noise. Without the guiding term, NRI (green) finds solutions that are substantially different from those found by other methods, which explains the low reconstruction quality.
Figure 4: (Left) Reconstruction qualitative results: Comparing image inversion-reconstruction performance. While all baseline methods struggle to preserve the original image, GNRI successfully excels in accurately reconstructing it. (Right) Inversion Results: Mean reconstruction quality (y-axis, PSNR) and runtime (x-axis, seconds) on the COCO2017 validation set. Our method achieves high PSNR while reducing inversion-reconstruction time by a factor of $\times 2$ (compared to DDIM) and up to $\times 40$ (compared to ExactDPM) on SDXL-turbo and $\times 10$ to $\times 40$ on Flux.1, compared to other approaches.
Figure 5: (Left) Qualitative results of image editing. GNRI edits images more naturally while preserving the structure of the original image. All baselines were executed until they reached convergence. (Right) Evaluation of editing performance: GNRI achieves superior CLIP and LPIPS scores, indicating better compliance with text prompts and higher structure preservation.
...and 10 more figures

Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

TL;DR

Abstract

Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)