Table of Contents
Fetching ...

FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing

Kaixiang Yang, Boyang Shen, Xin Li, Yuchen Dai, Yuxuan Luo, Yueran Ma, Wei Fang, Qiang Li, Zhiwei Wang

TL;DR

FIA-Edit tackles efficient, high-fidelity text-guided image editing in an inversion-free setting by explicitly modeling source–target interactions. It introduces Frequency-Interactive Attention with two modules, Frequency Representation Interaction (FRI) and Feature Injection (FIJ), to fuse frequency components and inject source features into target cross-attention, producing a velocity-field update $v^{\Delta}_t$. Built on Rectified Flow, FIA-Edit delivers fast editing (~6s per 512×512 on RTX 4090) and achieves state-of-the-art background preservation and semantic control on PIE-Bench, with additional demonstration of clinical bleeding augmentation improving downstream classification. The work also presents the first application of text-guided editing to medical images, enabling anatomically coherent hemorrhage variations. The code is publicly available at the provided GitHub repository.

Abstract

Text-guided image editing has advanced rapidly with the rise of diffusion models. While flow-based inversion-free methods offer high efficiency by avoiding latent inversion, they often fail to effectively integrate source information, leading to poor background preservation, spatial inconsistencies, and over-editing due to the lack of effective integration of source information. In this paper, we present FIA-Edit, a novel inversion-free framework that achieves high-fidelity and semantically precise edits through a Frequency-Interactive Attention. Specifically, we design two key components: (1) a Frequency Representation Interaction (FRI) module that enhances cross-domain alignment by exchanging frequency components between source and target features within self-attention, and (2) a Feature Injection (FIJ) module that explicitly incorporates source-side queries, keys, values, and text embeddings into the target branch's cross-attention to preserve structure and semantics. Comprehensive and extensive experiments demonstrate that FIA-Edit supports high-fidelity editing at low computational cost (~6s per 512 * 512 image on an RTX 4090) and consistently outperforms existing methods across diverse tasks in visual quality, background fidelity, and controllability. Furthermore, we are the first to extend text-guided image editing to clinical applications. By synthesizing anatomically coherent hemorrhage variations in surgical images, FIA-Edit opens new opportunities for medical data augmentation and delivers significant gains in downstream bleeding classification. Our project is available at: https://github.com/kk42yy/FIA-Edit.

FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing

TL;DR

FIA-Edit tackles efficient, high-fidelity text-guided image editing in an inversion-free setting by explicitly modeling source–target interactions. It introduces Frequency-Interactive Attention with two modules, Frequency Representation Interaction (FRI) and Feature Injection (FIJ), to fuse frequency components and inject source features into target cross-attention, producing a velocity-field update . Built on Rectified Flow, FIA-Edit delivers fast editing (~6s per 512×512 on RTX 4090) and achieves state-of-the-art background preservation and semantic control on PIE-Bench, with additional demonstration of clinical bleeding augmentation improving downstream classification. The work also presents the first application of text-guided editing to medical images, enabling anatomically coherent hemorrhage variations. The code is publicly available at the provided GitHub repository.

Abstract

Text-guided image editing has advanced rapidly with the rise of diffusion models. While flow-based inversion-free methods offer high efficiency by avoiding latent inversion, they often fail to effectively integrate source information, leading to poor background preservation, spatial inconsistencies, and over-editing due to the lack of effective integration of source information. In this paper, we present FIA-Edit, a novel inversion-free framework that achieves high-fidelity and semantically precise edits through a Frequency-Interactive Attention. Specifically, we design two key components: (1) a Frequency Representation Interaction (FRI) module that enhances cross-domain alignment by exchanging frequency components between source and target features within self-attention, and (2) a Feature Injection (FIJ) module that explicitly incorporates source-side queries, keys, values, and text embeddings into the target branch's cross-attention to preserve structure and semantics. Comprehensive and extensive experiments demonstrate that FIA-Edit supports high-fidelity editing at low computational cost (~6s per 512 * 512 image on an RTX 4090) and consistently outperforms existing methods across diverse tasks in visual quality, background fidelity, and controllability. Furthermore, we are the first to extend text-guided image editing to clinical applications. By synthesizing anatomically coherent hemorrhage variations in surgical images, FIA-Edit opens new opportunities for medical data augmentation and delivers significant gains in downstream bleeding classification. Our project is available at: https://github.com/kk42yy/FIA-Edit.

Paper Structure

This paper contains 37 sections, 15 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: FIA-Edit is capable of handling a wide range of image editing tasks, including object modification, addition and removal, color transformation, and text replacement.
  • Figure 2: Overview of inversion-based and inversion-free image editing methods. (a) Inversion-based methods first invert the source image to noise, then edit from noise using the target prompt, often injecting source features during denoising. (b) Inversion-free methods bypass inversion by estimating velocity fields from noisy latent to source and noisy latent to target. Their difference defines the editing direction from source to target. However, this does not guarantee that the result preserves both background and target semantics (dark red ellipse). (c) Our method incorporates source-aware constraints (i.e., FIA Constraint) during the computation of the noisy latent to target velocity field $v^{tar}_t$, effectively guiding the editing trajectory toward regions that preserve background fidelity while achieving semantic accuracy. Dashed arrows in (c) indicate FlowEdit, which lacks this guidance and fails to reach such optimal region.
  • Figure 3: Details of our framework. (a) Overview of FIA-Edit. During the computation of source and target velocity fields, we introduce the FIA constraint to enable interaction between source and target features. (b) FIA constraint. (c) Frequency Representation Interaction (FRI). FRI is integrated into the self-attention layers. Both source and target Q/K features are fused in the frequency domain, and the fused output replaces the target Q/K. The right side shows the detailed structure of the frequency-domain fusion module $fri$. (d) Feature Injection (FIJ). FIJ is used in the cross-attention layers in the latter of DiT.
  • Figure 4: Qualitative comparison. Our method preserves the background while accurately reflecting the target semantics. White circles highlight cases where other methods poorly preserve non-editing regions.
  • Figure 5: Qualitative comparison of editing results with and without the proposed FIA constraint. Each unit shows a source image (left), an edited result without FIA (middle, $w/o$ FIA), and an edited result with FIA (right, $w/$ FIA). The FIA constraint significantly improves background preservation and enhances semantic accuracy in the edited regions.
  • ...and 6 more figures