Table of Contents
Fetching ...

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong

TL;DR

DragFlow tackles drag-based image editing with DiT priors by introducing region-level affine supervision, hard background constraints, and adapter-enhanced inversion to leverage FLUX's stronger priors. It demonstrates that point-based drag fails on DiTs due to fine-grained feature geometry and inversion drift, motivating region-based supervision. The framework also uses MLLM-driven intent resolution and introduces ReD Bench for region-aware evaluation, showing state-of-the-art performance on DragBench-DR and ReD Bench. The work enables more faithful, controllable edits on complex, detail-rich images with strong subject fidelity.

Abstract

Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.

DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing

TL;DR

DragFlow tackles drag-based image editing with DiT priors by introducing region-level affine supervision, hard background constraints, and adapter-enhanced inversion to leverage FLUX's stronger priors. It demonstrates that point-based drag fails on DiTs due to fine-grained feature geometry and inversion drift, motivating region-based supervision. The framework also uses MLLM-driven intent resolution and introduces ReD Bench for region-aware evaluation, showing state-of-the-art performance on DragBench-DR and ReD Bench. The work enables more faithful, controllable edits on complex, detail-rich images with strong subject fidelity.

Abstract

Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.

Paper Structure

This paper contains 60 sections, 26 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Comparison of drag-editing results between baselines and our method, DragFlow. DragFlow successfully unleashes FLUX’s stronger generative prior, removing the distortions that previous methods produced on challenging scenarios.
  • Figure 2: Comparison of feature maps extracted from UNet and DiT at the same denoising step. UNet produces spatially compact, highly compressed features that capture high-level semantic information, whereas DiT generates finer-grained, spatially precise representations.
  • Figure 3: Overview of the DragFlow framework. The original image is inverted into a noisy latent space and iteratively optimized under the proposed region-level affine supervision. Subject consistency is reinforced through key-value (KV) injection and our adapter-enhanced inversion, while background fidelity is maintained via gradient mask-based hard constraints. In addition, a multimodal large language model (MLLM) is employed to better interpret and clarify user intents.
  • Figure 4: Visualization of the effect of adapter-enhanced inversion on subject consistency, compared with KV injection alone.
  • Figure 5: Qualitative comparison of our method with multiple baselines in challenging scenarios.
  • ...and 8 more figures