Table of Contents
Fetching ...

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

Gwanhyeong Koo, Sunjae Yoon, Ji Woo Hong, Chang D. Yoo

TL;DR

This work identifies that high-frequency components of the DDIM latent $z_T$ encode layout information, which hampers flexible non-rigid editing. It introduces FlexiEdit, a frequency-aware editing framework built on Latent Refinement and a three-branch Re-inversion architecture to reduce $z_T$'s high-frequency content in edited regions and reintegrate original attributes during retargeting. The method utilizes an editing mask $M$, frequency parameter $\alpha$, and a re-inversion duration $t_R$ (with $\alpha_R=\beta_R=0.5$) to balance layout changes and fidelity, validated on PIE-Bench and ELITE against multiple baselines with six quantitative metrics. Results show enhanced non-rigid editing flexibility and competitive rigid editing performance, offering a principled approach to frequency-aware latent refinement in diffusion-based image editing. Overall, FlexiEdit advances practical, text-guided image editing by addressing latent-frequency constraints and enabling more natural, layout-aware modifications.

Abstract

Current image editing methods primarily utilize DDIM Inversion, employing a two-branch diffusion approach to preserve the attributes and layout of the original image. However, these methods encounter challenges with non-rigid edits, which involve altering the image's layout or structure. Our comprehensive analysis reveals that the high-frequency components of DDIM latent, crucial for retaining the original image's key features and layout, significantly contribute to these limitations. Addressing this, we introduce FlexiEdit, which enhances fidelity to input text prompts by refining DDIM latent, by reducing high-frequency components in targeted editing areas. FlexiEdit comprises two key components: (1) Latent Refinement, which modifies DDIM latent to better accommodate layout adjustments, and (2) Edit Fidelity Enhancement via Re-inversion, aimed at ensuring the edits more accurately reflect the input text prompts. Our approach represents notable progress in image editing, particularly in performing complex non-rigid edits, showcasing its enhanced capability through comparative experiments.

FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing

TL;DR

This work identifies that high-frequency components of the DDIM latent encode layout information, which hampers flexible non-rigid editing. It introduces FlexiEdit, a frequency-aware editing framework built on Latent Refinement and a three-branch Re-inversion architecture to reduce 's high-frequency content in edited regions and reintegrate original attributes during retargeting. The method utilizes an editing mask , frequency parameter , and a re-inversion duration (with ) to balance layout changes and fidelity, validated on PIE-Bench and ELITE against multiple baselines with six quantitative metrics. Results show enhanced non-rigid editing flexibility and competitive rigid editing performance, offering a principled approach to frequency-aware latent refinement in diffusion-based image editing. Overall, FlexiEdit advances practical, text-guided image editing by addressing latent-frequency constraints and enabling more natural, layout-aware modifications.

Abstract

Current image editing methods primarily utilize DDIM Inversion, employing a two-branch diffusion approach to preserve the attributes and layout of the original image. However, these methods encounter challenges with non-rigid edits, which involve altering the image's layout or structure. Our comprehensive analysis reveals that the high-frequency components of DDIM latent, crucial for retaining the original image's key features and layout, significantly contribute to these limitations. Addressing this, we introduce FlexiEdit, which enhances fidelity to input text prompts by refining DDIM latent, by reducing high-frequency components in targeted editing areas. FlexiEdit comprises two key components: (1) Latent Refinement, which modifies DDIM latent to better accommodate layout adjustments, and (2) Edit Fidelity Enhancement via Re-inversion, aimed at ensuring the edits more accurately reflect the input text prompts. Our approach represents notable progress in image editing, particularly in performing complex non-rigid edits, showcasing its enhanced capability through comparative experiments.
Paper Structure (31 sections, 12 equations, 13 figures, 2 tables)

This paper contains 31 sections, 12 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Comparative editing results using FlexiEdit (ours), MasaCtrl masactrl, and Prompt-to-Prompt (P2P) prompt_to_prompt. FlexiEdit outperforms other methods in non-rigid edits by providing more flexibility in altering layouts and achieving more natural results in rigid edits.
  • Figure 2: (a) Comparison of non-rigid edit outcomes between MasaCtrl masactrl and FlexiEdit, showing FlexiEdit's enhanced flexibility. (b) A schematic of Latent Refinement in FlexiEdit, illustrating the reduction of high-frequency components in the original latent for improved non-rigid editing. (c) Comparative CLIP similarity scores for P2P prompt_to_prompt, MasaCtrl masactrl, and FlexiEdit in rigid and non-rigid edits on the PIE benchmark proxedit.
  • Figure 3: (a), (b) Show the PSNR and LPIPS results of reconstructing $z^{H, \alpha}_T$, and $z^{L, \alpha}_T$ in comparison to the original image. (c) visualizes the reconstruction outcome across different alpha values, indicating that high-frequency components play a more significant role in forming the object's layout than low-frequency components.
  • Figure 4: The pipeline of FlexiEdit. (a) Our method utilizes the refined latent $z'_{T}$ to achieve $I_{mid}$, which significantly alters the original image’s layout. Following re-inversion over a duration of $t_R$, features from the original image are injected during the resampling process, resulting in the final edited image, $I_{tar}$. (b) The refinement process within the edited region of the latent entails reducing high-frequency components by a factor of $\alpha$ while incorporating Gaussian noise proportional to $(1 - \alpha)$.
  • Figure 5: Illustrates the results of adjusting $\alpha$ values on latent refined within the user mask $M$ region, resulting in $I_{recon}$, $I_{mid}$, and $I_{tar}$. As the $\alpha$ value decreases, there are more significant deviations from the original image's layout. In contrast, higher $\alpha$ values result in a layout that closely aligns with the original image.
  • ...and 8 more figures