Table of Contents
Fetching ...

GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models

Zewei Zhang, Huan Liu, Jun Chen, Xiangyu Xu

TL;DR

GoodDrag addresses instability in diffusion-based drag editing by introducing Alternating Drag and Denoising (AlDD) and Information-Preserving Motion Supervision (IP-MS). It couples these techniques with a new Drag100 benchmark and evaluation metrics (DAI and Gemini Score) to quantify drag accuracy and perceptual quality. Empirical results show GoodDrag outperforms state-of-the-art methods on both fidelity and precise point manipulation, while maintaining practical runtime and memory usage. The work establishes a strong baseline for diffusion-based drag editing and paves the way for broader application, including potential extensions to video editing.

Abstract

In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose an information-preserving motion supervision operation that maintains the original features of the starting point for precise manipulation and artifact reduction. In addition, we contribute to the benchmarking of drag editing by introducing a new dataset, Drag100, and developing dedicated quality assessment metrics, Dragging Accuracy Index and Gemini Score, utilizing Large Multimodal Models. Extensive experiments demonstrate that the proposed GoodDrag compares favorably against the state-of-the-art approaches both qualitatively and quantitatively. The project page is https://gooddrag.github.io.

GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models

TL;DR

GoodDrag addresses instability in diffusion-based drag editing by introducing Alternating Drag and Denoising (AlDD) and Information-Preserving Motion Supervision (IP-MS). It couples these techniques with a new Drag100 benchmark and evaluation metrics (DAI and Gemini Score) to quantify drag accuracy and perceptual quality. Empirical results show GoodDrag outperforms state-of-the-art methods on both fidelity and precise point manipulation, while maintaining practical runtime and memory usage. The work establishes a strong baseline for diffusion-based drag editing and paves the way for broader application, including potential extensions to video editing.

Abstract

In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose an information-preserving motion supervision operation that maintains the original features of the starting point for precise manipulation and artifact reduction. In addition, we contribute to the benchmarking of drag editing by introducing a new dataset, Drag100, and developing dedicated quality assessment metrics, Dragging Accuracy Index and Gemini Score, utilizing Large Multimodal Models. Extensive experiments demonstrate that the proposed GoodDrag compares favorably against the state-of-the-art approaches both qualitatively and quantitatively. The project page is https://gooddrag.github.io.
Paper Structure (17 sections, 12 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 12 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Existing diffusion-based drag editing methods (dotted trajectory), typically perform all drag operations at once, followed by denoising steps to correct the resulting perturbations. However, this approach often leads to accumulated perturbations that are too substantial for high-fidelity correction. In contrast, the proposed AlDD framework (solid trajectory) alternates between drag and denoising operations within the diffusion process, effectively preventing the accumulation of large perturbations and ensuring more accurate editing results. The drag operation modifies the image to achieve the desired dragging effect but introduces perturbations that deviate the intermediate result from the natural image manifold. The denoising operation, on the other hand, is trained to estimate the score function of the natural image distribution, guiding intermediate results back to the image manifold.
  • Figure 2: Given an input image (Original) and user-specified control points (User Edit), our proposed GoodDrag effectively "drags" the semantic contents from the initial handle point to the target point, as indicated by the white arrow. The blue point is the target point, fixed throughout the pipeline, while the red point represents the handle point moving closer to the target point during the optimization of GoodDrag. Optionally, users can select an indication mask to specify the editable region as shown in the User Edit column.
  • Figure 3: Overview of the proposed AlDD framework. (a) Existing methods first perform all drag editing operations $\{g_k\}_{k=1}^K$ at a single time step $T$ and subsequently apply all denoising operations $\{f_t\}_{t=T}^1$ to transform the edited image $z_T^K$ into the VAE image space. (b) To mitigate the accumulated perturbations in (a), AlDD alternates between the drag operation $g$ and the diffusion denoising operation $f$, which leads to higher quality results. Specifically, we apply one denoising operation after every $B$ drag steps and ensure the total number of drag steps $K$ is divisible by $B$. We set $B=2$ in this figure for clarity.
  • Figure 4: We generate 10 random noise samples from the distribution $\mathcal{N}(0,0.1^2\mathbf{I})$ and compare two scenarios: (b) adding all samples simultaneously to $z_T$ and (c) adding each sample individually across 10 different time steps. In the former case, where all noise samples are added to $z_T$ at once, the resulting image exhibits significant degradation. In contrast, when we distribute the noise samples across multiple time steps, the resulting image well preserves the original content with high fidelity.
  • Figure 5: Illustration of the feature drifting issue. In (d), the initial handle points are located near the boundary of the beach wave. As drag editing progresses, the features of the handle points deviate from their original appearance. We show the intermediate result at the 90th motion supervision (MS) step in (e), where the handle points have drifted away from the wave boundary, leading to artifacts and inaccurate point movement in (b). To alleviate this issue, we propose information-preserving motion supervision (IP) to preserve the fidelity of the handle points to the original points as shown in (f), which effectively facilitates higher-quality results in (c).
  • ...and 9 more figures