Table of Contents
Fetching ...

DragText: Rethinking Text Embedding in Point-based Image Editing

Gayoon Choi, Taejin Jeong, Sujung Hong, Seong Jae Hwang

TL;DR

DragText addresses the problem that static text embeddings hinder point-based diffusion editing by causing drag halting and semantic drift. It proposes a joint optimization framework that updates the text embedding in parallel with image dragging and includes a regularization term to preserve the original prompt, enabling seamless plug-and-play integration with diffusion-based drag methods. The approach yields consistent improvements in dragging accuracy and content preservation across methods, validated by qualitative results and metrics such as MD and the product LPIPS×MD, while also enabling controllable manipulation of both image and text embeddings. The work highlights the critical role of text image coupling in interactive editing and suggests broader implications for text-conditioned diffusion systems and prompt-aware editing pipelines.

Abstract

Point-based image editing enables accurate and flexible control through content dragging. However, the role of text embedding during the editing process has not been thoroughly investigated. A significant aspect that remains unexplored is the interaction between text and image embeddings. During the progressive editing in a diffusion model, the text embedding remains constant. As the image embedding increasingly diverges from its initial state, the discrepancy between the image and text embeddings presents a significant challenge. In this study, we found that the text prompt significantly influences the dragging process, particularly in maintaining content integrity and achieving the desired manipulation. Upon these insights, we propose DragText, which optimizes text embedding in conjunction with the dragging process to pair with the modified image embedding. Simultaneously, we regularize the text optimization process to preserve the integrity of the original text prompt. Our approach can be seamlessly integrated with existing diffusion-based drag methods, enhancing performance with only a few lines of code.

DragText: Rethinking Text Embedding in Point-based Image Editing

TL;DR

DragText addresses the problem that static text embeddings hinder point-based diffusion editing by causing drag halting and semantic drift. It proposes a joint optimization framework that updates the text embedding in parallel with image dragging and includes a regularization term to preserve the original prompt, enabling seamless plug-and-play integration with diffusion-based drag methods. The approach yields consistent improvements in dragging accuracy and content preservation across methods, validated by qualitative results and metrics such as MD and the product LPIPS×MD, while also enabling controllable manipulation of both image and text embeddings. The work highlights the critical role of text image coupling in interactive editing and suggests broader implications for text-conditioned diffusion systems and prompt-aware editing pipelines.

Abstract

Point-based image editing enables accurate and flexible control through content dragging. However, the role of text embedding during the editing process has not been thoroughly investigated. A significant aspect that remains unexplored is the interaction between text and image embeddings. During the progressive editing in a diffusion model, the text embedding remains constant. As the image embedding increasingly diverges from its initial state, the discrepancy between the image and text embeddings presents a significant challenge. In this study, we found that the text prompt significantly influences the dragging process, particularly in maintaining content integrity and achieving the desired manipulation. Upon these insights, we propose DragText, which optimizes text embedding in conjunction with the dragging process to pair with the modified image embedding. Simultaneously, we regularize the text optimization process to preserve the integrity of the original text prompt. Our approach can be seamlessly integrated with existing diffusion-based drag methods, enhancing performance with only a few lines of code.
Paper Structure (38 sections, 12 equations, 18 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 12 equations, 18 figures, 3 tables, 1 algorithm.

Figures (18)

  • Figure 1: In point-based image editing, a user first draws a mask on the image to define an editable region, then edits the image by "dragging" the contents from the user-defined handle points (red) to target points (blue). Our DragText stabilizes the process of point-based image editing through parallel image and text optimization, consistently improving an array of existing diffusion-based drag methods.
  • Figure 2: Illustration of the drag editing process within the image and text embedding spaces of the diffusion model (DM). During editing, the original image embedding $\mathbf{z}_t$ naturally deviates to the dragged image latent vector $\mathbf{\bar{z}}_t$. Without text optimization, the corresponding text embedding $\mathbf{c}$ is decoupled from $\mathbf{\bar{z}}_t$, resulting in drag halting. Hence, optimal text embedding $\mathbf{\hat{c}}$ coupled with dragged images has to be acquired to make the optimal latent vector $\mathbf{\hat{z}}_t$ which then holds the related semantics via text.
  • Figure 3: (a) Red boxes imply the consistent use of text prompts across diffusion processes, resulting in high-fidelity sampling. In contrast, other images result from inconsistent text prompt usage, leading to inaccurate sampling outcomes. (b) The original text prompt and its alternative intention text prompt, crafted from prompt engineering, are insufficient to prevent drag halting.
  • Figure 4: The pipeline of DragText. The image $\mathbf{x}_0$ is encoded through a VAE encoder as the latent vector $\mathbf{z}$, and the text is encoded by a CLIP text encoder as the text embedding $\mathbf{c}$. Through DDIM inversion with $\mathbf{c}$, the latent vector $\mathbf{z}_t$ is obtained. At time step $t=35$, $\mathbf{z}^0_t$ and $\mathbf{c}$ are optimized to $\mathbf{\hat{z}}^k_t$ and $\mathbf{\hat{c}}$ by iterating with motion supervision (M.S.), text optimization, and point tracking (P.T.) $k$-times.
  • Figure 5: Qualitative results of DragText. As demonstrated in the comparison of editing results with and without the application of DragText to the baseline model, DragDiffusion DragDiffusion integrating DragText demonstrates improved semantic control and precision in dragging.
  • ...and 13 more figures