InstantDrag: Improving Interactivity in Drag-based Image Editing
Joonghyuk Shin, Daehyeon Choi, Jaesik Park
TL;DR
InstantDrag tackles the slow interactivity of drag-based image editing by introducing an optimization-free pipeline that decouples motion generation from motion-conditioned diffusion. FlowGen generates dense optical flow from sparse drag cues, and FlowDiffusion performs flow-conditioned edits without text prompts or masks, trained on real-world video data to capture realistic motion. The approach achieves near real-time edits with improved fidelity, while reducing input requirements and memory usage compared to optimization-based methods; it generalizes beyond faces to general scenes, though very large motions or unseen domains may require fine-tuning. Overall, InstantDrag advances interactive, real-time drag-based editing by delivering fast, high-quality, mask-free edits using dedicated motion-generation and diffusion components.
Abstract
Drag-based image editing has recently gained popularity for its interactivity and precision. However, despite the ability of text-to-image models to generate samples within a second, drag editing still lags behind due to the challenge of accurately reflecting user interaction while maintaining image content. Some existing approaches rely on computationally intensive per-image optimization or intricate guidance-based methods, requiring additional inputs such as masks for movable regions and text prompts, thereby compromising the interactivity of the editing process. We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed, requiring only an image and a drag instruction as input. InstantDrag consists of two carefully designed networks: a drag-conditioned optical flow generator (FlowGen) and an optical flow-conditioned diffusion model (FlowDiffusion). InstantDrag learns motion dynamics for drag-based image editing in real-world video datasets by decomposing the task into motion generation and motion-conditioned image generation. We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts through experiments on facial video datasets and general scenes. These results highlight the efficiency of our approach in handling drag-based image editing, making it a promising solution for interactive, real-time applications.
