Table of Contents
Fetching ...

AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing

DuoSheng Chen, Binghui Chen, Yifeng Geng, Liefeng Bo

TL;DR

AdaptiveDrag tackles the limitations of prior drag-based image editing methods by delivering a mask-free, semantics-aware editing framework. It combines an auto mask generation step (SAM-2 plus SLIC) with a semantic-driven latent optimization and a Correspondence Loss (CLoss) to stabilize diffusion sampling, enabling accurate dragging from handle points to target points across diverse domains. The approach demonstrates superior precision and feature preservation on tasks such as resizing, movement, and extension in scenes ranging from animals and faces to landscapes and clothing, with strong generalization to new domains. These contributions advance interactive diffusion-based editing by reducing user burden and aligning edits with meaningful semantic regions, offering practical gains for fine-grained image manipulation.

Abstract

Recently, several point-based image editing methods (e.g., DragDiffusion, FreeDrag, DragNoise) have emerged, yielding precise and high-quality results based on user instructions. However, these methods often make insufficient use of semantic information, leading to less desirable results. In this paper, we proposed a novel mask-free point-based image editing method, AdaptiveDrag, which provides a more flexible editing approach and generates images that better align with user intent. Specifically, we design an auto mask generation module using super-pixel division for user-friendliness. Next, we leverage a pre-trained diffusion model to optimize the latent, enabling the dragging of features from handle points to target points. To ensure a comprehensive connection between the input image and the drag process, we have developed a semantic-driven optimization. We design adaptive steps that are supervised by the positions of the points and the semantic regions derived from super-pixel segmentation. This refined optimization process also leads to more realistic and accurate drag results. Furthermore, to address the limitations in the generative consistency of the diffusion model, we introduce an innovative corresponding loss during the sampling process. Building on these effective designs, our method delivers superior generation results using only the single input image and the handle-target point pairs. Extensive experiments have been conducted and demonstrate that the proposed method outperforms others in handling various drag instructions (e.g., resize, movement, extension) across different domains (e.g., animals, human face, land space, clothing).

AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing

TL;DR

AdaptiveDrag tackles the limitations of prior drag-based image editing methods by delivering a mask-free, semantics-aware editing framework. It combines an auto mask generation step (SAM-2 plus SLIC) with a semantic-driven latent optimization and a Correspondence Loss (CLoss) to stabilize diffusion sampling, enabling accurate dragging from handle points to target points across diverse domains. The approach demonstrates superior precision and feature preservation on tasks such as resizing, movement, and extension in scenes ranging from animals and faces to landscapes and clothing, with strong generalization to new domains. These contributions advance interactive diffusion-based editing by reducing user burden and aligning edits with meaningful semantic regions, offering practical gains for fine-grained image manipulation.

Abstract

Recently, several point-based image editing methods (e.g., DragDiffusion, FreeDrag, DragNoise) have emerged, yielding precise and high-quality results based on user instructions. However, these methods often make insufficient use of semantic information, leading to less desirable results. In this paper, we proposed a novel mask-free point-based image editing method, AdaptiveDrag, which provides a more flexible editing approach and generates images that better align with user intent. Specifically, we design an auto mask generation module using super-pixel division for user-friendliness. Next, we leverage a pre-trained diffusion model to optimize the latent, enabling the dragging of features from handle points to target points. To ensure a comprehensive connection between the input image and the drag process, we have developed a semantic-driven optimization. We design adaptive steps that are supervised by the positions of the points and the semantic regions derived from super-pixel segmentation. This refined optimization process also leads to more realistic and accurate drag results. Furthermore, to address the limitations in the generative consistency of the diffusion model, we introduce an innovative corresponding loss during the sampling process. Building on these effective designs, our method delivers superior generation results using only the single input image and the handle-target point pairs. Extensive experiments have been conducted and demonstrate that the proposed method outperforms others in handling various drag instructions (e.g., resize, movement, extension) across different domains (e.g., animals, human face, land space, clothing).

Paper Structure

This paper contains 31 sections, 6 equations, 20 figures, 3 tables, 1 algorithm.

Figures (20)

  • Figure 1: Existing methods face two main issues: (a) 'Drag missing' (left): EasyDrag fails to guide the succulent to the target points because the point search is ineffective during long-scale drag instructions. (b) 'Feature maintenance failure' (right): DragDiffusion fails to maintain the feature in the middle part of the mountain when the peak is dragged to a higher position.
  • Figure 2: The overall framework of AdaptiveDrag comprises four key steps: diffusion model inversion, auto mask generation, semantic-driven optimization, and correspondence sample. Firstly, the model obtains the noised feature $z_t$ through inversion and generates the mask using the auto mask generation module. Secondly, the semantic-driven optimization updates $z_t$ based on the handle point $p_i^0$ and the target point $t_i$ specified in the user's instructions. Thirdly, we perform the sampling operation to denoise $z'_t$ using reference-latent-control ($K, V$) and the corresponding feature alignment loss ($CLoss$) on $z'_t$. Finally, we obtain the drag result from the $z'_0$, as predicted by DDIM sampling.
  • Figure 3: Results of different segmentation schemes. (a) The SAM 2 ravi2024sam segmentation result for the landscape view, effectively separating the overall mountain from its surroundings. (b) The super-pixel patches generated by the SLIC algorithm in the RGB space of the input image, appear chaotic. (c) The result of applying SLIC in the feature space of SAM 2, reveals a clearer and more finely divided representation of the mountainous region. (d) The auto mask generated when the user drags upward from the peak area. (e) / (f) The drag results of DragDiffusion and ours show that the proposed approach achieves a more precise positioning while preserving the original features of the mountain, effectively avoiding the mixing of the two peaks.
  • Figure 4: Illustration of our position supervised backtracking pipeline. $p_i^0$, $h_i^k$, $t_i$ denote the handle point, the current searching point in $k$-th updating, and the target point, respectively. The left side illustrates the standard optimization process, while the right side presents our backtracking design, which incorporates both the moving direction and moving distance into the constraints of point optimization.
  • Figure 5: Illustration of the semantic-driven feature optimization where the red, yellow, and blue points represent the handle, predict, and target points, separately. (a) The input image with user instructions. (b) The point tracking process utilizes a fixed square patch (red box) that includes additional grass features (indicated by the pink arrow). (d) The semantic region design provides a more precise mask for the patch, as illustrated in the red and yellow boxes. (c) / (e) Visual comparison: DragDiffusion employs a fixed square region with length $r$, where the grass features are mixed with the stone. In contrast, our approach produces a clearer dragging result based on the semantic region.
  • ...and 15 more figures