Table of Contents
Fetching ...

Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models

Ali Subhan, Ashir Raza

TL;DR

DragDiffusion enables interactive point-based image editing by optimizing a latent at an intermediate timestep $t$ with motion supervision on UNet features, accompanied by identity-preserving LoRA fine-tuning and spatial mask regularization. This reproducibility study independently replays the main ablations on diffusion timestep, LoRA strength, mask weight, and UNet feature level using the authors’ code and the DragBench benchmark, and also tests a multi-timestep latent optimization extension. The results largely corroborate the original claims: intermediate timesteps yield the best balance of spatial control and image fidelity, LoRA fine-tuning is essential, and mid-level UNet features provide the best guidance, while multi-timestep optimization increases cost without improving performance. The study also highlights practical sensitivities to a small set of hyperparameters and documents environment dependencies that affect reproducibility, underscoring both the robustness and the practical considerations necessary for applying DragDiffusion in practice.

Abstract

DragDiffusion is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity-preserving fine-tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors' released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA-based fine-tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi-timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at https://github.com/AliSubhan5341/DragDiffusion-TMLR-Reproducibility-Challenge.

Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models

TL;DR

DragDiffusion enables interactive point-based image editing by optimizing a latent at an intermediate timestep with motion supervision on UNet features, accompanied by identity-preserving LoRA fine-tuning and spatial mask regularization. This reproducibility study independently replays the main ablations on diffusion timestep, LoRA strength, mask weight, and UNet feature level using the authors’ code and the DragBench benchmark, and also tests a multi-timestep latent optimization extension. The results largely corroborate the original claims: intermediate timesteps yield the best balance of spatial control and image fidelity, LoRA fine-tuning is essential, and mid-level UNet features provide the best guidance, while multi-timestep optimization increases cost without improving performance. The study also highlights practical sensitivities to a small set of hyperparameters and documents environment dependencies that affect reproducibility, underscoring both the robustness and the practical considerations necessary for applying DragDiffusion in practice.

Abstract

DragDiffusion is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity-preserving fine-tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors' released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA-based fine-tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi-timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at https://github.com/AliSubhan5341/DragDiffusion-TMLR-Reproducibility-Challenge.
Paper Structure (40 sections, 8 figures, 6 tables)

This paper contains 40 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of the DragDiffusion pipeline reproduced in this study. Given an input image, prompt, mask, and handle--target point pairs, the image is inverted using DDIM. A single latent at timestep $t$ is optimized via motion supervision and mask regularization. The optimized latent is then used during forward DDIM denoising with attention control to generate the edited image. Identity-preserving LoRA weights are optionally applied during generation.
  • Figure 2: Mean Distance as a function of the number of LoRA fine-tuning steps. Performance improves rapidly with early fine-tuning and peaks at 100 steps, while gains beyond 80 steps are modest. The qualitative trend matches Fig. 7(b) in Shi_2024_CVPR.
  • Figure 3: Qualitative effect of the number of LoRA fine-tuning steps on drag-based editing. All results correspond to the same input image and drag instruction. Increasing the number of LoRA steps improves edit stability and identity preservation, with minimal visible differences beyond 80--100 steps.
  • Figure 4: Effect of the UNet decoder block used for motion supervision. Mean Distance (left) and Image Fidelity (right) illustrate the trade-off between spatial accuracy and appearance preservation across feature levels. The reproduced ordering of decoder blocks exactly matches the original findings (Fig. 7(c), Shi_2024_CVPR, with mid-level features achieving the best spatial accuracy.
  • Figure 5: Qualitative comparison of mask regularization strength $\lambda$. Without regularization ($\lambda=0$), background distortions are visible due to unconstrained latent updates. Moderate regularization ($\lambda=0.1$) provides the best balance between accurate point manipulation and preservation of non-edited regions. Stronger regularization ($\lambda \geq 0.5$) increasingly restricts motion, leading to ineffective edits despite improved image fidelity.
  • ...and 3 more figures