Table of Contents
Fetching ...

StableDrag: Stable Dragging for Point-based Image Editing

Yutao Cui, Xiaotong Zhao, Guozhen Zhang, Shengming Cao, Kai Ma, Limin Wang

TL;DR

This work tackles unstable long-range, point-based image editing observed in DragGAN and DragDiffusion by introducing StableDrag, a framework with two key innovations: discriminative point tracking that learns a lightweight convolutional filter to reliably locate updated handle points, and a confidence-based latent enhancement strategy that ensures complete, high-quality motion supervision across all editing steps. Built atop both GAN (StableDrag-GAN) and diffusion (StableDrag-Diff) models, StableDrag demonstrates improved stability and precision on DragBench, outperforming prior methods in mean distance and image fidelity, especially for challenging or long-range manipulations. The combination of a fast, discriminative tracker and an adaptive supervision scheme enables more reliable, pixel-level edits with minimal runtime overhead, offering a generalizable approach for high-quality image editing across generative paradigms and practical release-ready implementations.

Abstract

Point-based image editing has attracted remarkable attention since the emergence of DragGAN. Recently, DragDiffusion further pushes forward the generative quality via adapting this dragging technique to diffusion models. Despite these great success, this dragging scheme exhibits two major drawbacks, namely inaccurate point tracking and incomplete motion supervision, which may result in unsatisfactory dragging outcomes. To tackle these issues, we build a stable and precise drag-based editing framework, coined as StableDrag, by designing a discirminative point tracking method and a confidence-based latent enhancement strategy for motion supervision. The former allows us to precisely locate the updated handle points, thereby boosting the stability of long-range manipulation, while the latter is responsible for guaranteeing the optimized latent as high-quality as possible across all the manipulation steps. Thanks to these unique designs, we instantiate two types of image editing models including StableDrag-GAN and StableDrag-Diff, which attains more stable dragging performance, through extensive qualitative experiments and quantitative assessment on DragBench.

StableDrag: Stable Dragging for Point-based Image Editing

TL;DR

This work tackles unstable long-range, point-based image editing observed in DragGAN and DragDiffusion by introducing StableDrag, a framework with two key innovations: discriminative point tracking that learns a lightweight convolutional filter to reliably locate updated handle points, and a confidence-based latent enhancement strategy that ensures complete, high-quality motion supervision across all editing steps. Built atop both GAN (StableDrag-GAN) and diffusion (StableDrag-Diff) models, StableDrag demonstrates improved stability and precision on DragBench, outperforming prior methods in mean distance and image fidelity, especially for challenging or long-range manipulations. The combination of a fast, discriminative tracker and an adaptive supervision scheme enables more reliable, pixel-level edits with minimal runtime overhead, offering a generalizable approach for high-quality image editing across generative paradigms and practical release-ready implementations.

Abstract

Point-based image editing has attracted remarkable attention since the emergence of DragGAN. Recently, DragDiffusion further pushes forward the generative quality via adapting this dragging technique to diffusion models. Despite these great success, this dragging scheme exhibits two major drawbacks, namely inaccurate point tracking and incomplete motion supervision, which may result in unsatisfactory dragging outcomes. To tackle these issues, we build a stable and precise drag-based editing framework, coined as StableDrag, by designing a discirminative point tracking method and a confidence-based latent enhancement strategy for motion supervision. The former allows us to precisely locate the updated handle points, thereby boosting the stability of long-range manipulation, while the latter is responsible for guaranteeing the optimized latent as high-quality as possible across all the manipulation steps. Thanks to these unique designs, we instantiate two types of image editing models including StableDrag-GAN and StableDrag-Diff, which attains more stable dragging performance, through extensive qualitative experiments and quantitative assessment on DragBench.
Paper Structure (24 sections, 5 equations, 9 figures, 5 tables)

This paper contains 24 sections, 5 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The comparison between DragGAN/DragDiffusion shi2023dragdiffusion and our proposed StableDrag. StableDrag-GAN and StableDrag-Diff are our proposed methods constructed upon GAN and Diffusion models respectively. Given an image input (synthetic image by GAN/Diffusion model, or real image), users can assign handle points (red points) and target points (blue points) to drive the semantic positions of the handle points to reach corresponding target points. The example of the Mona Lisa portrait and examples in the last row are the real-image inputs, while the others are synthetic from StyleGAN2 or Stable Diffusion-V1.5 rombach2022high models. The examples demonstrate that our method achieves more precise point-level manipulation and generates higher-quality editing image than DragGAN and DragDiffusion.
  • Figure 2: Illustration of our dragging scheme for an intermediate single-step optimization. The core of the dragging pipeline illustrated herein is based on GAN, whereas the one based on diffusion models remains the same. 'Discriminative PT.' denotes for discriminative point tracking module and 'Confident MS.' represents for confident motion supervision process. $P_{i}$ means the current handle point at $i^{th}$ step optimization. Notably, the tracking model, in the form of a convolution filter, is only learned at the first optimization step and can be just employed in the subsequent steps. Details about its learning process at the first step are described in Fig. \ref{['fig:track_model']}. The latent code $w$ is supposed to be optimized via the backward updating across all steps.
  • Figure 3: Learning process of our point tracking model. It is only performed before the manipulation process. The initial feature of the local patch gets detached, indicating that only the tracking model is supposed to be optimized. The tracking model weight is initialized with the the template feature $f_i$.
  • Figure 4: Comparison between FreeDrag ling2023freedrag and our StableDrag. For the example in the top left, handle points at each optimization step are visualized to show the difference of the optimization path of FreeDrag and our StableDrag-GAN. The example in the bottom left is to demonstrate our method's strength in creating novel content. And the others are to show that StableDrag can generate more precise dragging outcomes.
  • Figure 5: Comparison between DragGAN pan2023drag/DragDiffusion shi2023dragdiffusion/FreeDrag ling2023freedrag and our StableDrag. As in DragGAN, users can optionally draw a mask of the flexible region (brighter area), keeping the rest of the image fixed. The green dashed box in the examples of the Terra Cotta Warriors Sculpture and the Panda is to show the differences in detail. Best viewed with zooming in.
  • ...and 4 more figures