FlexiEdit: Frequency-Aware Latent Refinement for Enhanced Non-Rigid Editing
Gwanhyeong Koo, Sunjae Yoon, Ji Woo Hong, Chang D. Yoo
TL;DR
This work identifies that high-frequency components of the DDIM latent $z_T$ encode layout information, which hampers flexible non-rigid editing. It introduces FlexiEdit, a frequency-aware editing framework built on Latent Refinement and a three-branch Re-inversion architecture to reduce $z_T$'s high-frequency content in edited regions and reintegrate original attributes during retargeting. The method utilizes an editing mask $M$, frequency parameter $\alpha$, and a re-inversion duration $t_R$ (with $\alpha_R=\beta_R=0.5$) to balance layout changes and fidelity, validated on PIE-Bench and ELITE against multiple baselines with six quantitative metrics. Results show enhanced non-rigid editing flexibility and competitive rigid editing performance, offering a principled approach to frequency-aware latent refinement in diffusion-based image editing. Overall, FlexiEdit advances practical, text-guided image editing by addressing latent-frequency constraints and enabling more natural, layout-aware modifications.
Abstract
Current image editing methods primarily utilize DDIM Inversion, employing a two-branch diffusion approach to preserve the attributes and layout of the original image. However, these methods encounter challenges with non-rigid edits, which involve altering the image's layout or structure. Our comprehensive analysis reveals that the high-frequency components of DDIM latent, crucial for retaining the original image's key features and layout, significantly contribute to these limitations. Addressing this, we introduce FlexiEdit, which enhances fidelity to input text prompts by refining DDIM latent, by reducing high-frequency components in targeted editing areas. FlexiEdit comprises two key components: (1) Latent Refinement, which modifies DDIM latent to better accommodate layout adjustments, and (2) Edit Fidelity Enhancement via Re-inversion, aimed at ensuring the edits more accurately reflect the input text prompts. Our approach represents notable progress in image editing, particularly in performing complex non-rigid edits, showcasing its enhanced capability through comparative experiments.
