Table of Contents
Fetching ...

FreSca: Scaling in Frequency Space Enhances Diffusion Models

Chao Huang, Susan Liang, Yunlong Tang, Jing Bi, Li Ma, Yapeng Tian, Chenliang Xu

TL;DR

FreSca addresses fine-grained control in latent diffusion models by leveraging frequency-domain manipulation of the classifier-free guidance noise difference $\Delta\epsilon_t$. It analyzes frequency representations across pixel and latent spaces to identify $\Delta\epsilon_t$ as a semantically rich target, and then decomposes it into low- and high-frequency components with flexible cutoffs and independent scales $l$ and $h$. The framework is model- and task-agnostic, enabling plug-in use across SDXL, SD3, depth estimation, editing, and video synthesis without retraining. Empirically, FreSca improves generation quality across multiple tasks, demonstrating broad applicability and practical impact.

Abstract

Latent diffusion models (LDMs) have achieved remarkable success in a variety of image tasks, yet achieving fine-grained, disentangled control over global structures versus fine details remains challenging. This paper explores frequency-based control within latent diffusion models. We first systematically analyze frequency characteristics across pixel space, VAE latent space, and internal LDM representations. This reveals that the "noise difference" term, derived from classifier-free guidance at each step t, is a uniquely effective and semantically rich target for manipulation. Building on this insight, we introduce FreSca, a novel and plug-and-play framework that decomposes noise difference into low- and high-frequency components and applies independent scaling factors to them via spatial or energy-based cutoffs. Essentially, FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control. We demonstrate its versatility and effectiveness in improving generation quality and structural emphasis on multiple architectures (e.g., SD3, SDXL) and across applications including image generation, editing, depth estimation, and video synthesis, thereby unlocking a new dimension of expressive control within LDMs.

FreSca: Scaling in Frequency Space Enhances Diffusion Models

TL;DR

FreSca addresses fine-grained control in latent diffusion models by leveraging frequency-domain manipulation of the classifier-free guidance noise difference . It analyzes frequency representations across pixel and latent spaces to identify as a semantically rich target, and then decomposes it into low- and high-frequency components with flexible cutoffs and independent scales and . The framework is model- and task-agnostic, enabling plug-in use across SDXL, SD3, depth estimation, editing, and video synthesis without retraining. Empirically, FreSca improves generation quality across multiple tasks, demonstrating broad applicability and practical impact.

Abstract

Latent diffusion models (LDMs) have achieved remarkable success in a variety of image tasks, yet achieving fine-grained, disentangled control over global structures versus fine details remains challenging. This paper explores frequency-based control within latent diffusion models. We first systematically analyze frequency characteristics across pixel space, VAE latent space, and internal LDM representations. This reveals that the "noise difference" term, derived from classifier-free guidance at each step t, is a uniquely effective and semantically rich target for manipulation. Building on this insight, we introduce FreSca, a novel and plug-and-play framework that decomposes noise difference into low- and high-frequency components and applies independent scaling factors to them via spatial or energy-based cutoffs. Essentially, FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control. We demonstrate its versatility and effectiveness in improving generation quality and structural emphasis on multiple architectures (e.g., SD3, SDXL) and across applications including image generation, editing, depth estimation, and video synthesis, thereby unlocking a new dimension of expressive control within LDMs.

Paper Structure

This paper contains 23 sections, 5 equations, 25 figures, 7 tables.

Figures (25)

  • Figure 1: FreSca: A plug-and-play enhancement for diffusion models. Without retraining, FreSca refines Marigold ke2023repurposing depth predictions to recover fine details (top); enables precise, prompt-aligned generation over SD3 sd3 (middle) ; and boosts motion, detail, and temporal consistency in VideoCrafter2 chen2024videocrafter2 video generation (bottom) .
  • Figure 2: (a) Frequency decomposition of an RGB image $(I_l,I_h)$ and its SD3 sd3/SDXL sdxl VAE encodings $(x_l,x_h)$ with $r_0=0.05$ (pixel) and $r_0=0.5$ (latent). (b) Cutoff‐radius sensitivity in pixel vs. latent space.
  • Figure 3: (a) SDXL outputs (left) and results of frequency decomposition on various diffusion representations (right); top: high‐frequency components, bottom: low‐frequency components; cutoff $r_0=0.5$. (b) Temporal average over $T$ steps for each representation, highlighting the semantic richness of the noise‐difference term.
  • Figure 4: Relative log amplitudes of Fourier over all $T$ denoising steps for (a) the latent variables $\mathbf{x}_t$, (b) the noise prediction $\epsilon_t$, and (c) the noise‐difference term $\Delta\epsilon_t$. Each curve corresponds to a timestep, illustrating how low and high frequencies changes in each representation.
  • Figure 5: Examples of original SDXL generations (top) and the generation results by applying high‐pass (middle) and low-pass filters (bottom) on $\Delta\epsilon_{1:T}$.
  • ...and 20 more figures