Table of Contents
Fetching ...

FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing

Yufan Ren, Zicong Jiang, Tong Zhang, Søren Forchhammer, Sabine Süsstrunk

TL;DR

This work tackles the limitations of text-guided edits in diffusion-based T2I systems, where changes often indiscriminately affect all frequency content. It introduces a frequency-aware denoising score that uses discrete wavelet transforms to decompose latent representations into low- and high-frequency subbands and applies selective optimization during editing. The approach enables accurate 2D image edits and extends to 3D texture editing through a frequency-decomposed triplane representation, with quantitative metrics and user studies showing improved detail preservation and color fidelity. By avoiding diffusion-model retraining and providing fine-grained frequency control, the method offers a practical path to more reliable and controllable image and texture edits.

Abstract

Text-guided image editing using Text-to-Image (T2I) models often fails to yield satisfactory results, frequently introducing unintended modifications, such as the loss of local detail and color changes. In this paper, we analyze these failure cases and attribute them to the indiscriminate optimization across all frequency bands, even though only specific frequencies may require adjustment. To address this, we introduce a simple yet effective approach that enables the selective optimization of specific frequency bands within localized spatial regions for precise edits. Our method leverages wavelets to decompose images into different spatial resolutions across multiple frequency bands, enabling precise modifications at various levels of detail. To extend the applicability of our approach, we provide a comparative analysis of different frequency-domain techniques. Additionally, we extend our method to 3D texture editing by performing frequency decomposition on the triplane representation, enabling frequency-aware adjustments for 3D textures. Quantitative evaluations and user studies demonstrate the effectiveness of our method in producing high-quality and precise edits.

FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing

TL;DR

This work tackles the limitations of text-guided edits in diffusion-based T2I systems, where changes often indiscriminately affect all frequency content. It introduces a frequency-aware denoising score that uses discrete wavelet transforms to decompose latent representations into low- and high-frequency subbands and applies selective optimization during editing. The approach enables accurate 2D image edits and extends to 3D texture editing through a frequency-decomposed triplane representation, with quantitative metrics and user studies showing improved detail preservation and color fidelity. By avoiding diffusion-model retraining and providing fine-grained frequency control, the method offers a practical path to more reliable and controllable image and texture edits.

Abstract

Text-guided image editing using Text-to-Image (T2I) models often fails to yield satisfactory results, frequently introducing unintended modifications, such as the loss of local detail and color changes. In this paper, we analyze these failure cases and attribute them to the indiscriminate optimization across all frequency bands, even though only specific frequencies may require adjustment. To address this, we introduce a simple yet effective approach that enables the selective optimization of specific frequency bands within localized spatial regions for precise edits. Our method leverages wavelets to decompose images into different spatial resolutions across multiple frequency bands, enabling precise modifications at various levels of detail. To extend the applicability of our approach, we provide a comparative analysis of different frequency-domain techniques. Additionally, we extend our method to 3D texture editing by performing frequency decomposition on the triplane representation, enabling frequency-aware adjustments for 3D textures. Quantitative evaluations and user studies demonstrate the effectiveness of our method in producing high-quality and precise edits.

Paper Structure

This paper contains 24 sections, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Text-guided image editing using Text-to-Image (T2I) models, such as DDS hertz2023delta, often fails to produce satisfactory results due to indiscriminate optimization across all frequency subbands. For example, in the top row, DDS removes the detailed pattern of latte art (a) and drastically alters the cat's color despite "gray" being specified in the prompt (b). These issues become more apparent through frequency decomposition during optimization (L.F.S and H.F.S refer to low frequency subband and high frequency subband, respectively) in the second row, where unnecessary modifications occur. Our method selectively optimizes frequency bands, preserving high-frequency details in the latte art (a) and maintaining color consistency in the gray cat (b). The "freeze" symbol indicates frozen frequency components, while the "flame" means optimized. Best viewed on a screen when zoomed in.
  • Figure 2: Method overview. Ours (a) differs from vanilla score distillation editing (b), which backpropagates gradients to the latent space ($z$)to perform editing. Our method leverages wavelet frequency decomposition to decompose latent $z$ into wavelet subbands $\phi$ including high frequency ($\phi_{\text{H.F.S}} = \{\phi_{\text{H.F.S}}^1, \phi_{\text{H.F.S}}^2, \cdots, \phi_{\text{H.F.S}}^J\}$) and low frequency ($\phi_{\text{L.F.S}}$). We process the reconstructed latent $z^*$ with the diffusion model to obtain a gradient for optimization, which is applied to either high-frequency components or low frequency components selectively depending on application. Consequently, our method produces edits that benefit from detail preservation (butterfly case, yellow for text and image borders) and color fidelity (stone, blue for text and image borders). Best viewed on a screen when zoomed in.
  • Figure 3: 3D texture editing pipeline. We represent a 3D texture field as a frequency-decomposed triplane $\phi$, i.e., three sets of wavelet subbands representing $XY-YZ-XZ$ in three directions. To render an image at camera view $p$, we construct a triplane from $\phi$, which is queried for colors. The rendered image is processed by the latent diffusion model to produce a gradient, which is backpropagated to update selected frequency components.
  • Figure 4: Qualitative results. We conducted a qualitative comparison with the most competitive baselines. For low-frequency editing, our method follows instructions closely while preserving high-frequency details better. In the first row, stone lion, our method preserves details of the lion's eyes and mouth (A). On the contrary, CDS, DDS and other methods lose these structures, introducing significant changes. For the second row, the chicken, we preserve the beak and eye areas (B). In contrast, other methods distort the structure noticeably or fail to generate meaningful images (e.g., DiffuseIT) and follow the target description (DreamSampler, FlexiEdit). CDS, the best among baselines, alters the beak. For high-frequency editing, our method maintains better color fidelity than the baselines. In the first row, our approach preserves color consistency in the transformation from cat to fox. In the stone-to-Buddha case, our method preserves both the background and statue colors (C) better than CDS and similar methods. In the third row, our method preserves image color information better, especially the pupil and face skin color, while still modifying details (D). Other methods introduce structure distortion, which can be attributed to the lack of global information guidance. Best viewed on a screen when zoomed in.
  • Figure 5: Qualitative results. Our frequency-aware denoising score method on 3D texture. We compare our results with the pure triplane representation. In the detail preservation cases (a) and (c), our method faithfully preserves the original texture from the source texture map. In contrast, SDS distorts the original texture and does not follow the text "blue" in the turtle case. In the color fidelity cases (b) and (d), our method retain the pink color of the quilt and the blue color of the sofa, while adding intricate patterns and texture details guided by the text. However, SDS creates patterns on the quilt but changes the main color to a much lighter pink. In the sofa case, SDS also adds texture details but shifts the main color to a greenish-blue hue whitish-red hue. Best viewed on a screen when zoomed in.
  • ...and 14 more figures