Table of Contents
Fetching ...

Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance

Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz, Eric Moulines, Jimmy Olsson, Yazid Janati

TL;DR

This work tackles zero-shot inpainting using pretrained diffusion priors by introducing Decoupled INpainting Guidance (DInG), a VJP-free method that yields exact Gaussian posterior transitions via decoupled likelihood surrogates in latent space. By evaluating the likelihood on an independent proxy and leveraging Gaussian conjugacy, DInG avoids backpropagation through the denoiser while maintaining high fidelity to observed regions and realistic completions, especially under low NFE budgets. Across FFHQ, DIV2K, and PIE-Bench, DInG outperforms state-of-the-art zero-shot baselines and even surpasses a finetuned SD3 model for editing tasks, demonstrating strong observation consistency with efficient inference. The approach offers a practical, memory-efficient path to high-quality zero-shot inpainting using latent diffusion models, with broad implications for real-time image editing and restoration.

Abstract

Diffusion models have emerged as powerful priors for image editing tasks such as inpainting and local modification, where the objective is to generate realistic content that remains consistent with observed regions. In particular, zero-shot approaches that leverage a pretrained diffusion model, without any retraining, have been shown to achieve highly effective reconstructions. However, state-of-the-art zero-shot methods typically rely on a sequence of surrogate likelihood functions, whose scores are used as proxies for the ideal score. This procedure however requires vector-Jacobian products through the denoiser at every reverse step, introducing significant memory and runtime overhead. To address this issue, we propose a new likelihood surrogate that yields simple and efficient to sample Gaussian posterior transitions, sidestepping the backpropagation through the denoiser network. Our extensive experiments show that our method achieves strong observation consistency compared with fine-tuned baselines and produces coherent, high-quality reconstructions, all while significantly reducing inference cost. Code is available at https://github.com/YazidJanati/ding.

Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance

TL;DR

This work tackles zero-shot inpainting using pretrained diffusion priors by introducing Decoupled INpainting Guidance (DInG), a VJP-free method that yields exact Gaussian posterior transitions via decoupled likelihood surrogates in latent space. By evaluating the likelihood on an independent proxy and leveraging Gaussian conjugacy, DInG avoids backpropagation through the denoiser while maintaining high fidelity to observed regions and realistic completions, especially under low NFE budgets. Across FFHQ, DIV2K, and PIE-Bench, DInG outperforms state-of-the-art zero-shot baselines and even surpasses a finetuned SD3 model for editing tasks, demonstrating strong observation consistency with efficient inference. The approach offers a practical, memory-efficient path to high-quality zero-shot inpainting using latent diffusion models, with broad implications for real-time image editing and restoration.

Abstract

Diffusion models have emerged as powerful priors for image editing tasks such as inpainting and local modification, where the objective is to generate realistic content that remains consistent with observed regions. In particular, zero-shot approaches that leverage a pretrained diffusion model, without any retraining, have been shown to achieve highly effective reconstructions. However, state-of-the-art zero-shot methods typically rely on a sequence of surrogate likelihood functions, whose scores are used as proxies for the ideal score. This procedure however requires vector-Jacobian products through the denoiser at every reverse step, introducing significant memory and runtime overhead. To address this issue, we propose a new likelihood surrogate that yields simple and efficient to sample Gaussian posterior transitions, sidestepping the backpropagation through the denoiser network. Our extensive experiments show that our method achieves strong observation consistency compared with fine-tuned baselines and produces coherent, high-quality reconstructions, all while significantly reducing inference cost. Code is available at https://github.com/YazidJanati/ding.

Paper Structure

This paper contains 49 sections, 2 theorems, 51 equations, 11 figures, 8 tables, 4 algorithms.

Key Result

Proposition 1

Both $\hpost{s\mathrel{\raisebox{0.15ex}{$\mid$}} t}{\mathbf{x}_t,\mathbf{y}}{}[\mathsf{dps}]$ and $\hpost{s\mathrel{\raisebox{0.15ex}{$\mid$}} t}{\mathbf{x}_t,\mathbf{y}}{}[\mathsf{ding}]$ are Gaussian distributions with mean and covariance respectively $(\bm{\mu}^\mathsf{dps} _{s\mathrel{\raisebox and as $\eta_s \rightarrow 0$.

Figures (11)

  • Figure 1: Zero-shot inpainting edits generated by DInG (50 NFEs) for different masking patterns using Stable Diffusion 3.5 (medium). Given masked inputs (left column), the model fills the missing regions according to diverse textual prompts.
  • Figure 2: Examples of reconstructions on FFHQ and DIV2K with 50 NFEs.
  • Figure 3: Latent-space masking and its correspondence to pixel space using a central square mask. The encoder and decoder of Stable Diffusion 3.5 (medium) were used. The first row shows latent images alongside the encoded mask applied to each, while the second row shows their decoded counterparts. Notice that the masked regions in the latent space translate directly to analogous masked regions in pixel space. For that sake of visualization, since the latent images have 16 channels, we apply PCA and visualize the first 3 components.
  • Figure 4: Performance of DING on DIV2K under varying NFE budgets (20 to 500) across different masking patterns. Runtimes are measured on a H100 GPU.
  • Figure 5: Effect of prompt precision on inpainting quality.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2: Upperbound on $\varepsilon_s$
  • proof