Table of Contents
Fetching ...

DragNeXt: Rethinking Drag-Based Image Editing

Yuan Zhou, Junbao Zhou, Qingshan Xu, Kesen Zhao, Yuxuan Wang, Hao Fei, Richang Hong, Hanwang Zhang

TL;DR

DragNeXt reframes drag-based image editing as Latent Region Optimization (LRO) over region-level transforms, addressing the core ambiguity of how and what to drag. It replaces brittle point-based motion supervision with Progressive Backward Self-Intervention (PBSI), which leverages intermediate drag states and diffusion-model priors to guide latent updates. The work introduces NextBench, a dedicated benchmark with explicit user-intention annotations, and demonstrates that DragNeXt achieves a superior efficiency–quality trade-off, outperforming existing methods on region-level metrics and user preferences. Together, these advances offer a more reliable, scalable framework for fine-grained, region-guided image editing using diffusion models.

Abstract

Drag-Based Image Editing (DBIE), which allows users to manipulate images by directly dragging objects within them, has recently attracted much attention from the community. However, it faces two key challenges: (\emph{\textcolor{magenta}{i}}) point-based drag is often highly ambiguous and difficult to align with users' intentions; (\emph{\textcolor{magenta}{ii}}) current DBIE methods primarily rely on alternating between motion supervision and point tracking, which is not only cumbersome but also fails to produce high-quality results. These limitations motivate us to explore DBIE from a new perspective -- redefining it as deformation, rotation, and translation of user-specified handle regions. Thereby, by requiring users to explicitly specify both drag areas and types, we can effectively address the ambiguity issue. Furthermore, we propose a simple-yet-effective editing framework, dubbed \textcolor{SkyBlue}{\textbf{DragNeXt}}. It unifies DBIE as a Latent Region Optimization (LRO) problem and solves it through Progressive Backward Self-Intervention (PBSI), simplifying the overall procedure of DBIE while further enhancing quality by fully leveraging region-level structure information and progressive guidance from intermediate drag states. We validate \textcolor{SkyBlue}{\textbf{DragNeXt}} on our NextBench, and extensive experiments demonstrate that our proposed method can significantly outperform existing approaches. Code will be released on github.

DragNeXt: Rethinking Drag-Based Image Editing

TL;DR

DragNeXt reframes drag-based image editing as Latent Region Optimization (LRO) over region-level transforms, addressing the core ambiguity of how and what to drag. It replaces brittle point-based motion supervision with Progressive Backward Self-Intervention (PBSI), which leverages intermediate drag states and diffusion-model priors to guide latent updates. The work introduces NextBench, a dedicated benchmark with explicit user-intention annotations, and demonstrates that DragNeXt achieves a superior efficiency–quality trade-off, outperforming existing methods on region-level metrics and user preferences. Together, these advances offer a more reliable, scalable framework for fine-grained, region-guided image editing using diffusion models.

Abstract

Drag-Based Image Editing (DBIE), which allows users to manipulate images by directly dragging objects within them, has recently attracted much attention from the community. However, it faces two key challenges: (\emph{\textcolor{magenta}{i}}) point-based drag is often highly ambiguous and difficult to align with users' intentions; (\emph{\textcolor{magenta}{ii}}) current DBIE methods primarily rely on alternating between motion supervision and point tracking, which is not only cumbersome but also fails to produce high-quality results. These limitations motivate us to explore DBIE from a new perspective -- redefining it as deformation, rotation, and translation of user-specified handle regions. Thereby, by requiring users to explicitly specify both drag areas and types, we can effectively address the ambiguity issue. Furthermore, we propose a simple-yet-effective editing framework, dubbed \textcolor{SkyBlue}{\textbf{DragNeXt}}. It unifies DBIE as a Latent Region Optimization (LRO) problem and solves it through Progressive Backward Self-Intervention (PBSI), simplifying the overall procedure of DBIE while further enhancing quality by fully leveraging region-level structure information and progressive guidance from intermediate drag states. We validate \textcolor{SkyBlue}{\textbf{DragNeXt}} on our NextBench, and extensive experiments demonstrate that our proposed method can significantly outperform existing approaches. Code will be released on github.

Paper Structure

This paper contains 24 sections, 2 theorems, 20 equations, 24 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

The ambiguity of DBIE is twofold:$\mapsto$Factor-1. drag operations inherently involve multiple types---such as translation, deformation, and rotation---and treating them as type-agnostic induces ambiguity about users' intentions (how to drag?); $\mapsto$Factor-2. point indicators are insufficient f

Figures (24)

  • Figure 1: Examples of the key issues in current DBIE. (i) Text prompts used in ClipDrag CLIPDrag remain insufficient for solving the ambiguity issue; (ii) predefined mapping functions employed by FastDrag FastDrag and RegionDrag RegionDrag boost efficiency but severely compromise editing quality. The numbers given in the upper-left corner of images indicate the latency for dragging the regions of handle points to target positions.
  • Figure 2: Factor-1 and -2.
  • Figure 3: Rethink DBIE.
  • Figure 4: Examples of estimating target regions.
  • Figure 5: A brief illustration of our DragNeXt.
  • ...and 19 more figures

Theorems & Definitions (4)

  • Proposition 1: Key Factors to Ambiguity
  • Proposition 2: Rethink DBIE
  • Definition 1: Reliable DBIE
  • Definition 2: Unify DBIE as LRO