Table of Contents
Fetching ...

FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models

Wei Wu, Qingnan Fan, Shuai Qin, Hong Gu, Ruoyu Zhao, Antoni B. Chan

TL;DR

This work tackles misalignment in text-guided image editing with diffusion models by introducing FreeDiff, a tuning-free approach that refines guidance through progressive frequency truncation in the Fourier domain. A frequency-perspective analysis reveals that early denoising favors low-frequency components, guiding the authors to target specific spatial-frequency bands via an effective frequency band during a defined response period. They categorize edits into SF-0, SF-1, and SF-2, and provide a practical two-step process for color/environment edits, along with default hyperparameters and an inversion-based editing pipeline. Across PIE-based evaluations, FreeDiff delivers competitive qualitative and quantitative results compared with attention-based methods, while avoiding architecture-level modifications and enabling broader editing capabilities across rigid and non-rigid tasks.

Abstract

Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.

FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models

TL;DR

This work tackles misalignment in text-guided image editing with diffusion models by introducing FreeDiff, a tuning-free approach that refines guidance through progressive frequency truncation in the Fourier domain. A frequency-perspective analysis reveals that early denoising favors low-frequency components, guiding the authors to target specific spatial-frequency bands via an effective frequency band during a defined response period. They categorize edits into SF-0, SF-1, and SF-2, and provide a practical two-step process for color/environment edits, along with default hyperparameters and an inversion-based editing pipeline. Across PIE-based evaluations, FreeDiff delivers competitive qualitative and quantitative results compared with attention-based methods, while avoiding architecture-level modifications and enabling broader editing capabilities across rigid and non-rigid tasks.

Abstract

Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive quncy truncation to refine the guidance of usion models for universal editing tasks (). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.
Paper Structure (24 sections, 13 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 13 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Editing results across different editing tasks using our proposed method FreeDiff demonstrate the effectiveness of our progressive frequency truncation strategy.
  • Figure 2: Visualized decoded intermediate features and Fourier transformed features from a generation process with SD v1.5Rombach_2022_CVPR, with the prompt "a lovely corgi running on a city street". The first, second, third, and fourth rows display the decoded noisy latents $x_t$s, the decoded $\tilde{x}_{t:0}$s, the guidance $g_t$, and the power spectrum of $x_0$ with the SNR (signal-to-noise ratio) indicator (red box) at the corresponding timestep. The timestep is shown at the bottom. The SNR box indicates where the signal (image) to latent noise ratio is greater than 1, which suggests the frequency bands that the network has higher probability to successfully recover $x_0$ from $x_t$. Note that to show lower frequency components, the same power spectrum is normalized with lower truncated upper bound as $t$ decreases.
  • Figure 3: Editing results from attention-based refining methods P2Phertz2022prompt+NTImokady2023null, PNPTumanyan_2023_CVPR+fixed-point inversion in Section \ref{['sec:method-freqtrunc']} and directly applying guidance. Column d) and e) shows $\mathcal{F}_{diff}(I_{src}, I_{edit})$ between <source image, attention-based editing>, <source image, direct editing>, respectively. The $\mathcal{F}_{diff}(I_{src}, I_{edit})$ is normalized to the same numerical scale in each row. The results suggest that direct editing introduces low-frequency components with higher amplitudes.
  • Figure 4: The pipeline of our proposed method. The progressive frequency truncation is only applied in the response period according to Alg. \ref{['Alg:prog']}, while guidance outside the response period is set to zero.
  • Figure 5: Qualitative results comparing with 3 typical attention-based editing methods: P2P, PNP, MasaCtrl on images from the PIE datasetju2023direct. Direct editing results with fixed-point inversion are also shown as a baseline.
  • ...and 9 more figures