Table of Contents
Fetching ...

ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention

Huiguo He, Pengyu Yan, Ziqi Yi, Weizhi Zhong, Zheng Liu, Yejun Tang, Huan Yang, Kun Gai, Guanbin Li, Lianwen Jin

TL;DR

ContextDrag tackles drag-based image editing by leveraging VAE-encoded reference features through CTI and LRM to inject context directly into attention, avoiding finetuning or inversion. It introduces Position-Consistent Attention (PRE and OAM) to align positional cues and suppress interference, yielding high-fidelity, texture-preserving edits. Empirical results on DragBench-SR and DragBench-DR show state-of-the-art editing accuracy (MD), strong texture fidelity (IF), and superior concept preservation and prompt following (CP and PF), with an overall CP·PF lead. The approach offers a practical, tuning-free solution that enhances texture coherence and semantic consistency during drag edits, while also outlining limitations on highly complex deformations.

Abstract

Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.

ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention

TL;DR

ContextDrag tackles drag-based image editing by leveraging VAE-encoded reference features through CTI and LRM to inject context directly into attention, avoiding finetuning or inversion. It introduces Position-Consistent Attention (PRE and OAM) to align positional cues and suppress interference, yielding high-fidelity, texture-preserving edits. Empirical results on DragBench-SR and DragBench-DR show state-of-the-art editing accuracy (MD), strong texture fidelity (IF), and superior concept preservation and prompt following (CP and PF), with an overall CP·PF lead. The approach offers a practical, tuning-free solution that enhances texture coherence and semantic consistency during drag edits, while also outlining limitations on highly complex deformations.

Abstract

Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.

Paper Structure

This paper contains 36 sections, 9 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Illustration of our ContextDrag framework. The Drag-Guided Editing Framework injects noise-free reference tokens into the correct target tokens identified by Latent-Space Reverse Mapping, preserving contextual and fine-grained details. Position-Consistent Attention mitigates interference from reference features through two key mechanisms: (1) it re-encodes the RoPE positional embeddings of reference tokens so that corresponding tokens share corrected positional RoPE and therefore receive higher attention scores; (2) it applies an Overlap-Aware Attention Mask to suppress irrelevant signals arising from overlapping regions.
  • Figure 2: Qualitative comparisons of drag editing between our ContextDrag and other SOTA methods. Our approach achieves the most faithful and consistent edits by effectively leveraging both contextual information and fine-grained texture details. The target points for dragging are marked with blue dots. Additional results are provided in the supplementary materials. Best viewed with zoom-in.
  • Figure 3: Qualitative comparisons of drag editing between our warping strategy and other warping methods. Our warping approach achieves superior editing accuracy and semantic consistency. Best viewed with zoom-in.
  • Figure 4: Qualitative comparisons of ablation study. Our full model accurately moves the object to the target destination while perfectly preserving its original appearance. Best viewed with zoom-in.
  • Figure 5: Failure cases of our ContextDrag and competing approaches. Like existing methods, our approach is unable to handle complex deformations, such as flipping. Best viewed with zoom-in.
  • ...and 10 more figures