Table of Contents
Fetching ...

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Lionel M. Ni, Gang Yu, Heung-Yeung Shum

TL;DR

LazyDrag addresses drag-based editing instability by replacing fragile implicit attention point matching with an explicit correspondence map derived from user drags. Built on Multi-Modal Diffusion Transformers, it enables stable full-strength inversion without test-time optimization and unifies precise geometric control with text guidance through a correspondence-driven attention scheme. It preserves identity and background, supports natural inpainting, and resolves ambiguous edits via text prompts, enabling complex edits previously unattainable. On DragBench, LazyDrag achieves state-of-the-art drag accuracy and perceptual quality, demonstrating a practical, TTO-free editing paradigm for diffusion-based generation.

Abstract

The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

TL;DR

LazyDrag addresses drag-based editing instability by replacing fragile implicit attention point matching with an explicit correspondence map derived from user drags. Built on Multi-Modal Diffusion Transformers, it enables stable full-strength inversion without test-time optimization and unifies precise geometric control with text guidance through a correspondence-driven attention scheme. It preserves identity and background, supports natural inpainting, and resolves ambiguous edits via text prompts, enabling complex edits previously unattainable. On DragBench, LazyDrag achieves state-of-the-art drag accuracy and perceptual quality, demonstrating a practical, TTO-free editing paradigm for diffusion-based generation.

Abstract

The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

Paper Structure

This paper contains 50 sections, 12 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: (a) Top: Comparison between our method and two baselines. The leftmost image shows the input image with multiple drag instructions, each indicated by a different color. The text below each result indicates the additional prompt used for generation. "N/A" means no additional prompt. TTO denotes test-time optimization, where the method requires fine-tuning per image and multi-step latent optimization per drag instruction. Notably, our method successfully opens the mouth of the dog and inpaints its interior. Furthermore, with prompt guidance, we can generate diverse results even under ambiguous drag inputs without fine-tuning. (b) Bottom: Multi-round editing results using our approach. Our method supports not only sequential drag operations but also simultaneous actions like movement and scaling, maintaining visual coherence throughout.
  • Figure 2: Effect of inversion strength. Examples of LazyDrag under different inversion strengths. The additional prompt is "a red apple in the mouth".
  • Figure 2: User study on Drag-Bench.
  • Figure 3: Pipeline of LazyDrag. (a) An input image is inverted to a latent code $\boldsymbol{z}_T$. Our correspondence map generation then yields an updated latent$\hat{\boldsymbol{z}}_T$, point matching map, and weights $\alpha$. Tokens cached during inversion are used to guide the sampling process for identity and background preservation. (b) In attention input control, a dual strategy is employed. For background regions (gray color), $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ tokens are replaced with their cached originals. For destination (red and blue colors) and transition regions (yellow color), the $\mathbf{K}$ and $\mathbf{V}$ tokens are concatenated with re-encoded ($\mathbf{K}$ only) source tokens retrieved via the map (c) Attention output refinement performs value blending of attention output. $\otimes$ and $\oplus$ denotes element-wise product and addition.
  • Figure 4: Qualitative results compared with baselines on Drag-Bench. Best viewed with zoom-in.
  • ...and 10 more figures