Table of Contents
Fetching ...

When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance

Ahmed Ghorbel, Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz, Eric Moulines, Jimmy Olsson, Yazid Janati

TL;DR

This work provides theoretical insights into their VJP-free approximation and substantially extends their empirical evaluation to large-scale image and video editing benchmarks, demonstrating that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.

Abstract

Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector--Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.

When Test-Time Guidance Is Enough: Fast Image and Video Editing with Diffusion Guidance

TL;DR

This work provides theoretical insights into their VJP-free approximation and substantially extends their empirical evaluation to large-scale image and video editing benchmarks, demonstrating that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.

Abstract

Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector--Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.
Paper Structure (17 sections, 7 equations, 6 figures, 3 tables)

This paper contains 17 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of the editing pipeline for video modalities. The input video and mask are lifted to the latent space for inpainting. A pre-trained and frozen diffusion model is used with a posterior sampler to guide the generation toward prompt-aligned reconstructions, which is then decoded back to pixel space.
  • Figure 2: Editing via inpainting using DInG with SD3 as prior. Given masked inputs, the model fills the missing regions according to diverse textual prompts. The runtime is limited to 10 seconds per image (1024px).
  • Figure 3: Visualization of context leakage in latent-space video inpainting. When lifting a pixel-space inpainting task to the latent space, downsampling the mask without adjustment can lead to boundary artifacts. From top to bottom row: input video, binary edit masks, reconstruction using naive downsampled masks, and reconstruction using the dilated masks. Note that naive downsampling (third row) causes the t-shirt's original boundary (blue outline) to leak into the latent reconstruction, whereas dilation (fourth row) successfully avoid this issue.
  • Figure 4: Qualitative comparison of DInG and SD3 with ControlNet (SD3 Inpaint) on HumanEdit. The methods are limited to a runtime of 30 seconds per image
  • Figure 5: Qualitative comparison 1 of DInG Flux with ControlNet (Flux Inpaint), and Flux Fill on HumanEdit. The methods are limited to a runtime of 30 seconds per image.
  • ...and 1 more figures