Table of Contents
Fetching ...

Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention

Seunghun Oh, Unsang Park

Abstract

Cross-attention is the primary interface through which text conditions latent diffusion models, yet its step-wise multi-resolution dynamics remain under-characterized, limiting principled training-free control. We cast diffusion cross-attention as a spatiotemporal signal on the latent grid by summarizing token-softmax weights into token-agnostic concentration maps and tracking their radially binned Fourier power over denoising. Across prompts and seeds, encoder cross-attention exhibits a consistent coarse-to-fine spectral progression, yielding a stable time-frequency fingerprint of token competition. Building on this structure, we introduce Attention Frequency Modulation (AFM), a plug-and-play inference-time intervention that edits token-wise pre-softmax cross-attention logits in the Fourier domain: low- and high-frequency bands are reweighted with a progress-aligned schedule and can be adaptively gated by token-allocation entropy, before the token softmax. AFM provides a continuous handle to bias the spatial scale of token-competition patterns without retraining, prompt editing, or parameter updates. Experiments on Stable Diffusion show that AFM reliably redistributes attention spectra and produces substantial visual edits while largely preserving semantic alignment. Finally, we find that entropy mainly acts as an adaptive gain on the same frequency-based edit rather than an independent control axis.

Attention Frequency Modulation: Training-Free Spectral Modulation of Diffusion Cross-Attention

Abstract

Cross-attention is the primary interface through which text conditions latent diffusion models, yet its step-wise multi-resolution dynamics remain under-characterized, limiting principled training-free control. We cast diffusion cross-attention as a spatiotemporal signal on the latent grid by summarizing token-softmax weights into token-agnostic concentration maps and tracking their radially binned Fourier power over denoising. Across prompts and seeds, encoder cross-attention exhibits a consistent coarse-to-fine spectral progression, yielding a stable time-frequency fingerprint of token competition. Building on this structure, we introduce Attention Frequency Modulation (AFM), a plug-and-play inference-time intervention that edits token-wise pre-softmax cross-attention logits in the Fourier domain: low- and high-frequency bands are reweighted with a progress-aligned schedule and can be adaptively gated by token-allocation entropy, before the token softmax. AFM provides a continuous handle to bias the spatial scale of token-competition patterns without retraining, prompt editing, or parameter updates. Experiments on Stable Diffusion show that AFM reliably redistributes attention spectra and produces substantial visual edits while largely preserving semantic alignment. Finally, we find that entropy mainly acts as an adaptive gain on the same frequency-based edit rather than an independent control axis.

Paper Structure

This paper contains 60 sections, 17 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Qualitative comparison on Stable Diffusion v1.5 under matched sampling settings (same prompt/seed). (a) Baseline, (b) SAG, (c) FreeU, (d) Ours (AFM).
  • Figure 2: Time--frequency evolution of encoder cross-attention (top-$K$, mean). Left/middle: normalized radial energy distributions for Baseline and AFM-curve. Right: log energy ratio $\log(E_{\text{curve}}/E_{\text{baseline}})$, highlighting frequency bands amplified/suppressed by AFM over denoising progress. The x-axis is denoising progress $u(s)$ (early $\rightarrow$ late); tick labels show the corresponding DDIM scheduler timesteps $\tau_s$ (decreasing). Dashed line indicates the HF cutoff radius $r_c$ used in $\rho_s$.
  • Figure 3: Quantitative summary of coarse-to-fine and AFM effects (encoder, top-$K$). (a) uses denoising progress $u(s)$ (early $\rightarrow$ late); tick labels show decreasing DDIM timesteps $\tau_s$. (b) shows radial energy profiles over normalized radius $r$.