Table of Contents
Fetching ...

CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing

Ziqi Jiang, Zhen Wang, Long Chen

TL;DR

CLIPDrag tackles the dual challenge of imprecise global edits and ambiguous local edits by integrating text-based guidance with drag-based control in diffusion-model editing. It introduces Global-Local Motion Supervision to fuse global CLIP-derived direction with local drag signals, and Fast Point Tracking to accelerate convergence of handle movements toward targets. An identity-preserving LoRA finetuning stage ensures fidelity, while GLMS combines gradients to either drive edits or preserve structure. Empirical results on DRAGBENCH demonstrate superior precision and reduced ambiguity over baselines, with faster convergence and robust performance across multiple edit scenarios.

Abstract

Precise and flexible image editing remains a fundamental challenge in computer vision. Based on the modified areas, most editing methods can be divided into two main types: global editing and local editing. In this paper, we choose the two most common editing approaches (ie text-based editing and drag-based editing) and analyze their drawbacks. Specifically, text-based methods often fail to describe the desired modifications precisely, while drag-based methods suffer from ambiguity. To address these issues, we proposed \textbf{CLIPDrag}, a novel image editing method that is the first to combine text and drag signals for precise and ambiguity-free manipulations on diffusion models. To fully leverage these two signals, we treat text signals as global guidance and drag points as local information. Then we introduce a novel global-local motion supervision method to integrate text signals into existing drag-based methods by adapting a pre-trained language-vision model like CLIP. Furthermore, we also address the problem of slow convergence in CLIPDrag by presenting a fast point-tracking method that enforces drag points moving toward correct directions. Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods.

CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing

TL;DR

CLIPDrag tackles the dual challenge of imprecise global edits and ambiguous local edits by integrating text-based guidance with drag-based control in diffusion-model editing. It introduces Global-Local Motion Supervision to fuse global CLIP-derived direction with local drag signals, and Fast Point Tracking to accelerate convergence of handle movements toward targets. An identity-preserving LoRA finetuning stage ensures fidelity, while GLMS combines gradients to either drive edits or preserve structure. Empirical results on DRAGBENCH demonstrate superior precision and reduced ambiguity over baselines, with faster convergence and robust performance across multiple edit scenarios.

Abstract

Precise and flexible image editing remains a fundamental challenge in computer vision. Based on the modified areas, most editing methods can be divided into two main types: global editing and local editing. In this paper, we choose the two most common editing approaches (ie text-based editing and drag-based editing) and analyze their drawbacks. Specifically, text-based methods often fail to describe the desired modifications precisely, while drag-based methods suffer from ambiguity. To address these issues, we proposed \textbf{CLIPDrag}, a novel image editing method that is the first to combine text and drag signals for precise and ambiguity-free manipulations on diffusion models. To fully leverage these two signals, we treat text signals as global guidance and drag points as local information. Then we introduce a novel global-local motion supervision method to integrate text signals into existing drag-based methods by adapting a pre-trained language-vision model like CLIP. Furthermore, we also address the problem of slow convergence in CLIPDrag by presenting a fast point-tracking method that enforces drag points moving toward correct directions. Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods.
Paper Structure (18 sections, 7 equations, 14 figures, 3 tables)

This paper contains 18 sections, 7 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Three different kinds of image editing methods. (a): Drag-based Editing. Users need to click handle points (red) and target points (blue). (b): Text-based Editing. Only the edit prompt is needed to perform the edit. (c): Text-Drag Editing. both drag points and edit prompts are required.
  • Figure 2: Illustration of our scheme for an intermediate single-step optimization. $z^t$means the optimized latent code at $t^{th}$ updation. The local gradient and global gradient are calculated by backwarding the motion supervision loss and CLIP guidance loss with respect to the latent code respectively. Then the Global-Local Gradient Fusion method is introduced to combine these two information to update the latent code. The new handles are inferred through our fast point tracking method.
  • Figure 3: (a) $G_g$ is consistent with $G_l$. (b) $G_g$ contradicts with $G_l$. (c) Fast Point Tracking.
  • Figure 4: Comparisons with Drag-based methods (DragDiff, FreeDrag) and Text-based methods (DiffCLIP). $P_e$ is the edited prompts, which are required by CLIPDrag and DiffCLIP.
  • Figure 5: Some examples of CLIPDrag. For each input, both an edit prompt and drag points are required. $P_e$ and $P_o$ represent the edit prompts and original prompts respectively.
  • ...and 9 more figures