Table of Contents
Fetching ...

Token Perturbation Guidance for Diffusion Models

Javad Rajabi, Soroush Mehraban, Seyedmorteza Sadat, Babak Taati

TL;DR

The paper addresses the limited applicability of classifier-free guidance (CFG) by proposing Token Perturbation Guidance (TPG), a training-free method that perturbs intermediate token representations to guide diffusion sampling. TPG relies on a norm-preserving, orthonormal perturbation (notably token shuffling) to generate a negative score that, when combined with the conditional signal, yields CFG-like guidance without architectural changes. Empirical results on SDXL and Stable Diffusion 2.1 show nearly a 2× improvement in unconditional Fréchet Inception Distance (FID) over baselines and competitive prompt alignment with CFG in conditional settings, demonstrating strong generalization across architectures and tasks. The approach offers a simple, plug-and-play alternative that extends CFG-like benefits to a broader class of diffusion models, with potential implications for faster deployment and broader applicability in conditional and unconditional generation scenarios.

Abstract

Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2$\times$ improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models.

Token Perturbation Guidance for Diffusion Models

TL;DR

The paper addresses the limited applicability of classifier-free guidance (CFG) by proposing Token Perturbation Guidance (TPG), a training-free method that perturbs intermediate token representations to guide diffusion sampling. TPG relies on a norm-preserving, orthonormal perturbation (notably token shuffling) to generate a negative score that, when combined with the conditional signal, yields CFG-like guidance without architectural changes. Empirical results on SDXL and Stable Diffusion 2.1 show nearly a 2× improvement in unconditional Fréchet Inception Distance (FID) over baselines and competitive prompt alignment with CFG in conditional settings, demonstrating strong generalization across architectures and tasks. The approach offers a simple, plug-and-play alternative that extends CFG-like benefits to a broader class of diffusion models, with potential implications for faster deployment and broader applicability in conditional and unconditional generation scenarios.

Abstract

Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2 improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models.

Paper Structure

This paper contains 22 sections, 5 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Visualization of the denoising process over time for different guidance strategies: CFG ho2022classifier, PAG ahn2024self, SEG hong2024smoothed, and our proposed TPG. Each row shows generated images at various denoising time steps, from $t=981$ (left) to $t=1$ (right). The red box highlights the early-to-middle denoising stage ($t=821$ to $t=741$), where CFG and TPG demonstrate clearer structure (e.g. horse face) and consistency. The text prompt used is "a female in a black jacket is riding a brown and white horse".
  • Figure 2: Analyzing the behavior of different guidance methods across denoising steps. (a) Cosine similarity between the added guidance term $\Delta e$ in $\tilde{e}_\theta = e_\theta + \gamma \Delta e$ and the true noise $\epsilon$. SEG and PAG exhibit negative alignment at intermediate steps, while TPG and CFG maintain near-zero cosine values, indicating orthogonality to the noise. (b) Cosine similarity between the full guided score $\tilde{e}_\theta$ and $\epsilon$. Compared to SEG and PAG, TPG behaves more similarly to CFG across sampling. (c) $\ell_2$ norm of the guidance term $\Delta e$. TPG and CFG follow nearly identical trends, both starting around 40 and increasing steeply in the later denoising steps. In contrast, SEG and PAG maintain consistently low norms throughout.
  • Figure 3: Frequency analysis of guidance residuals throughout sampling. Each heatmap shows either the cosine similarity between the guidance term $\Delta e$ and the ground-truth noise $\epsilon$ (top row), or the $\ell_2$ norm of the guidance term (bottom row), as a function of frequency bin (horizontal axis) and denoising step (vertical axis; 1000 $\rightarrow$ 1). Top: For both CFG and TPG, the guidance term remains almost orthogonal to the noise across all frequencies, with a mild positive bump in the lowest bands. In contrast, SEG transitions from weak positive alignment in the early steps to a pronounced negative stripe centered at medium frequencies. Bottom: CFG and TPG concentrate most of their energy in the lowest frequency bin and inject significantly larger magnitudes than SEG, whose energy remains up to two orders of magnitude smaller throughout the denoising process.
  • Figure 4: Qualitative comparison of unconditional generations produced by Vanilla SDXL podell2023sdxl, PAG ahn2024self, SEG hong2024smoothed, and our method (TPG). TPG achieves more realistic generations compared to other training-free guidance methods.
  • Figure 5: Qualitative comparison of conditional generations produced by Vanilla SDXL podell2023sdxl, CFG ho2022classifier, PAG ahn2024self, SEG hong2024smoothed, and our method (TPG). TPG is able to achieve good quality and prompt alignment compared to other baselines such as PAG and SEG.
  • ...and 8 more figures