Table of Contents
Fetching ...

Dynamic Frequency Modulation for Controllable Text-driven Image Generation

Tiandong Shi, Ling Zhao, Ji Qi, Jiayi Ma, Chengli Peng

TL;DR

This work identifies that low-frequency latent components set the global structure early in text-guided diffusion, while higher-frequency components shape fine textures later. It introduces a training-free Frequency Modulation Method (FMM) that dynamically weights and blends frequency components between the original and refined prompts, enforcing structural consistency while allowing semantic updates. The approach is grounded in a frequency-domain analysis of diffusion, leveraging PSD and SNR relationships to justify a coarse-to-fine generation, and uses a Gaussian, dynamically decaying weighting function with parameters $\alpha$ and $\sigma$ to control spectrum influence. Empirical results on PIE-Bench and ImageNetR-Fake show that FMM outperforms state-of-the-art spatial-domain methods in balancing structure preservation with semantic alignment, with additional insights into real-image editing and potential limitations due to inversion perturbations.

Abstract

The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.

Dynamic Frequency Modulation for Controllable Text-driven Image Generation

TL;DR

This work identifies that low-frequency latent components set the global structure early in text-guided diffusion, while higher-frequency components shape fine textures later. It introduces a training-free Frequency Modulation Method (FMM) that dynamically weights and blends frequency components between the original and refined prompts, enforcing structural consistency while allowing semantic updates. The approach is grounded in a frequency-domain analysis of diffusion, leveraging PSD and SNR relationships to justify a coarse-to-fine generation, and uses a Gaussian, dynamically decaying weighting function with parameters and to control spectrum influence. Empirical results on PIE-Bench and ImageNetR-Fake show that FMM outperforms state-of-the-art spatial-domain methods in balancing structure preservation with semantic alignment, with additional insights into real-image editing and potential limitations due to inversion perturbations.

Abstract

The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.
Paper Structure (26 sections, 11 equations, 12 figures, 4 tables)

This paper contains 26 sections, 11 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The key challenges faced by the image generation paradigm driven by the iterative refinement of text prompts. When the user changes the original text prompt for expected specific semantic adjustments, the newly generated images often exhibit unexpected reconstructions of composition, posture, background, etc., which goes against the creative intention.
  • Figure 2: Visual comparison of generation results obtained by applying high-pass filtering to the noisy latent variable $\bm{z}_t$ at different stages. The first column shows the images generated without filtering, whereas the subsequent columns show the images generated when the intervention is applied during the early, middle, and late stages, respectively.
  • Figure 3: Overview of the frequency modulation method. The method dynamically fuses the frequency components of $\bm{z}_t^{original}$ and $\bm{z}_t^{refined}$ using a frequency-dependent weighting function $\omega(d,t)$ with a dynamic decay strategy. It imposes stronger constraints in the early generation stage to preserve structure while gradually relaxing these constraints in the later generation stage to enhance fine-grained texture synthesis.
  • Figure 4: Overall generation pipeline. At each generation step, the FMM dynamically modulates the frequency components of the $z_{t}^{ref}$ using the proposed frequency-dependent weighting function $\omega(d,t)$ with a dynamic decay. Consequently, the final image $I_{ref}$ inherits the structure framework of image $I_{ori}$ while faithfully reflecting the semantic content specified by the text prompt $p_{ref}$.
  • Figure 5: Qualitative comparison on various iterative prompt refinement scenarios. The proposed method well generates the semantic contents specified in the refined prompts while strictly preserving the structure of the original images. In contrast, the other methods either fail to preserve the structure or struggle to render the semantic contents accurately.
  • ...and 7 more figures