Table of Contents
Fetching ...

InstantDrag: Improving Interactivity in Drag-based Image Editing

Joonghyuk Shin, Daehyeon Choi, Jaesik Park

TL;DR

InstantDrag tackles the slow interactivity of drag-based image editing by introducing an optimization-free pipeline that decouples motion generation from motion-conditioned diffusion. FlowGen generates dense optical flow from sparse drag cues, and FlowDiffusion performs flow-conditioned edits without text prompts or masks, trained on real-world video data to capture realistic motion. The approach achieves near real-time edits with improved fidelity, while reducing input requirements and memory usage compared to optimization-based methods; it generalizes beyond faces to general scenes, though very large motions or unseen domains may require fine-tuning. Overall, InstantDrag advances interactive, real-time drag-based editing by delivering fast, high-quality, mask-free edits using dedicated motion-generation and diffusion components.

Abstract

Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.

InstantDrag: Improving Interactivity in Drag-based Image Editing

TL;DR

InstantDrag tackles the slow interactivity of drag-based image editing by introducing an optimization-free pipeline that decouples motion generation from motion-conditioned diffusion. FlowGen generates dense optical flow from sparse drag cues, and FlowDiffusion performs flow-conditioned edits without text prompts or masks, trained on real-world video data to capture realistic motion. The approach achieves near real-time edits with improved fidelity, while reducing input requirements and memory usage compared to optimization-based methods; it generalizes beyond faces to general scenes, though very large motions or unseen domains may require fine-tuning. Overall, InstantDrag advances interactive, real-time drag-based editing by delivering fast, high-quality, mask-free edits using dedicated motion-generation and diffusion components.

Abstract

Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.
Paper Structure (27 sections, 11 equations, 19 figures, 2 tables)

This paper contains 27 sections, 11 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Illustration of our inference pipeline. Given sparse user drag input, our FlowGen estimates dense optical flow, and our FlowDiffusion edits the original image with the flow guidance. Our approach does not require auxiliary input, such as texts or foreground masks. Our approach is inversion and optimization-free, providing the edited image in about a second.
  • Figure 2: Illustration of our FlowGen architecture (Sec. \ref{['sec:3.1.1']}). Sparse user drag input is channel-wise concatenated with the input image and fed into the generator, which predicts dense optical flow. Based on a Pix2Pix-like GAN architecture, FlowGen is trained using the adversarial loss from the discriminator and the reconstruction loss from the generator.
  • Figure 3: Illustration of our FlowDiffusion architecture (Sec. \ref{['sec:3.1.2']}). The denoising U-Net of FlowDiffusion takes encoded image and downscaled optical flow as inputs. It leverages channel-wise concatenated input image and optical flow to guide the denoising process, learning to predict subsequent video frames in the latent space based on motion information.
  • Figure 4: Dragging results from FlowGen trained under four settings: (A) Stochastic sampling strategy (Sec. \ref{['sec:3.2.2']}), (B) 1 fixed point (nose), (C) 100 fixed grid points, (D) 900 fixed grid points. Excessive points (C, D) generate sparse motion while a single point (B) causes undesired movements. We find (A) to be the most robust, combining the advantages of the other approaches.
  • Figure 5: Visualization of the mask operation in Sec \ref{['sec:3.2.3']}. $I^{new}$ combines the object from $I_{1}$ and the background from $I_{2}$. Blue contour shows $I_2$'s mask and green contour shows its dilated mask.
  • ...and 14 more figures