LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos

Yujun Shi; Jun Hao Liew; Hanshu Yan; Vincent Y. F. Tan; Jiashi Feng

LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos

Yujun Shi, Jun Hao Liew, Hanshu Yan, Vincent Y. F. Tan, Jiashi Feng

TL;DR

LightningDrag tackles the sluggish and accuracy-limited drag-based image editing by reframing the task as conditional generation powered by a latent diffusion backbone, an appearance encoder for identity preservation, and a point-embedding mechanism that encodes user drag instructions into attention. It learns from large-scale video supervision to model realistic object motion and deformation, enabling high-quality edits in ~1s without latent optimization during inference. The approach demonstrates superior accuracy and speed on DragBench, supports test-time refinements like noise priors and CFG-guided point following, and offers practical drag-engineering techniques such as point augmentation and sequential dragging. While built on Stable Diffusion 1.5, the authors show potential improvements via diffusion-model scaling and acceleration methods, underscoring substantial practical impact for fast, controllable image editing.

Abstract

Accuracy and speed are critical in image editing tasks. Pan et al. introduced a drag-based image editing framework that achieves pixel-level control using Generative Adversarial Networks (GANs). A flurry of subsequent studies enhanced this framework's generality by leveraging large-scale diffusion models. However, these methods often suffer from inordinately long processing times (exceeding 1 minute per edit) and low success rates. Addressing these issues head on, we present LightningDrag, a rapid approach enabling high quality drag-based image editing in ~1 second. Unlike most previous methods, we redefine drag-based editing as a conditional generation task, eliminating the need for time-consuming latent optimization or gradient-based guidance during inference. In addition, the design of our pipeline allows us to train our model on large-scale paired video frames, which contain rich motion information such as object translations, changing poses and orientations, zooming in and out, etc. By learning from videos, our approach can significantly outperform previous methods in terms of accuracy and consistency. Despite being trained solely on videos, our model generalizes well to perform local shape deformations not presented in the training data (e.g., lengthening of hair, twisting rainbows, etc.). Extensive qualitative and quantitative evaluations on benchmark datasets corroborate the superiority of our approach. The code and model will be released at https://github.com/magic-research/LightningDrag.

LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos

TL;DR

Abstract

Paper Structure (22 sections, 5 equations, 9 figures, 2 tables)

This paper contains 22 sections, 5 equations, 9 figures, 2 tables.

Introduction
Related Works
Preliminaries
Latent Diffusion Models
Methodology
Paired supervision from video data
Architecture Design
Inpainting Backbone.
Appearance Encoder.
Point Embedding Attention.
Test-time Techniques to Improve Editing Results
Noise prior
Point-following classifier-free guidance
"Drag engineering" to improve the editing
Point augmentation
...and 7 more sections

Figures (9)

Figure 1: Samples of collected supervision pairs from videos. Video motion contains various transformation cues such as pose change, object movement and scale change, which are useful for the model to learn how objects change and deform while avoiding appearance change.
Figure 2: The pipeline of LightningDrag. Our LightningDrag consists of three components, including (1) an inpainting diffusion backbone to enforce unmasked regions remain untouched; (2) an Appearance Encoder for preserving the identity of the reference image; and (3) a Point Embedding Network to encode the (handle, target) points pairs.
Figure 3: Different strategies for constructing the noise prior. We find that the "noise source latents" strategy produces the best results. Image credit (source image): Pexels
Figure 4: Effects of different CFG scale schedules. Our model struggles to conduct a successful drag when CFG is not used. Constant CFG scale often leads to over-saturation problem. On overall, fast decaying strategy (Inverse square) attains the best results.
Figure 5: Point Augmentation. Augmenting with additional pairs of handle and target points can better convey the user's editing intention, which often leads to better performance.
...and 4 more figures

LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos

TL;DR

Abstract

LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (9)