Table of Contents
Fetching ...

InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing

Haoran Yu, Yi Shi

TL;DR

InstructUDrag addresses the complementary limitations of text-based image editing and object dragging by introducing a two-branch diffusion-based framework that jointly performs object relocation and text-guided edits. A moving-reconstruction branch preserves structure during dragging, while a text-driven editing branch applies semantic changes; both branches share gradient guidance to ensure coherent outputs, aided by non-target position mask learning and DDPM inversion with noise priors to maintain object integrity. The approach is training-free and extendable to object pasting, demonstrating superior object relocation fidelity, reduced trace remnants, and flexible semantic control compared with prior methods. Together, these elements enable high-fidelity, interactive image edits that simultaneously adjust position and appearance with precise user prompts, suitable for practical editing tasks in interactive pipelines.

Abstract

Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.

InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing

TL;DR

InstructUDrag addresses the complementary limitations of text-based image editing and object dragging by introducing a two-branch diffusion-based framework that jointly performs object relocation and text-guided edits. A moving-reconstruction branch preserves structure during dragging, while a text-driven editing branch applies semantic changes; both branches share gradient guidance to ensure coherent outputs, aided by non-target position mask learning and DDPM inversion with noise priors to maintain object integrity. The approach is training-free and extendable to object pasting, demonstrating superior object relocation fidelity, reduced trace remnants, and flexible semantic control compared with prior methods. Together, these elements enable high-fidelity, interactive image edits that simultaneously adjust position and appearance with precise user prompts, suitable for practical editing tasks in interactive pipelines.

Abstract

Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.

Paper Structure

This paper contains 19 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Our method allows users to specify a moving direction (or pasting position) of an object along with corresponding prompt words, Enabling text-driven editing of images while moving (or pasting).
  • Figure 2: Pipeline of the proposed InstructUDrag. Our framework performs text-guided editing and object dragging simultaneously through two branches. The first branch, referred to as the moving-reconstruction branch, handles object relocation while preserving structural features. The second branch, called the text-driven editing branch, performs text-guided image editing during object dragging, enabling synchronized semantic manipulation during dragging. Meanwhile, our method can flexibly adopt either cross-attention control or mutual self-attention control to enable more diverse text-driven image editing.
  • Figure 3: Effectiveness of the non-target position mask loss $\mathcal{L}_{npm}$. (a) is the input image, and (d) is the mask $\mathbf{M}_c$ of the target position. (b) and (c) are the attention maps corresponding to the object's prompt with/without $\mathcal{L}_{npm}$, (e) and (f) are the images output by the moving-reconstruction branch with/without $\mathcal{L}_{npm}$.
  • Figure 4: First row: text-driven editing branch using cross-attention control, with/without gradient guidance sharing. Second row: text-driven editing branch using mutual self-attention control, with/without gradient guidance sharing.
  • Figure 5: Qualitative comparison on object dragging. The red arrow indicates the direction of the object dragging. The source and target locations are denoted by blue and green points, respectively.
  • ...and 3 more figures