ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention
Huiguo He, Pengyu Yan, Ziqi Yi, Weizhi Zhong, Zheng Liu, Yejun Tang, Huan Yang, Kun Gai, Guanbin Li, Lianwen Jin
TL;DR
ContextDrag tackles drag-based image editing by leveraging VAE-encoded reference features through CTI and LRM to inject context directly into attention, avoiding finetuning or inversion. It introduces Position-Consistent Attention (PRE and OAM) to align positional cues and suppress interference, yielding high-fidelity, texture-preserving edits. Empirical results on DragBench-SR and DragBench-DR show state-of-the-art editing accuracy (MD), strong texture fidelity (IF), and superior concept preservation and prompt following (CP and PF), with an overall CP·PF lead. The approach offers a practical, tuning-free solution that enhances texture coherence and semantic consistency during drag edits, while also outlining limitations on highly complex deformations.
Abstract
Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.
