Table of Contents
Fetching ...

Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention

Susung Hong

TL;DR

Smoothed Energy Guidance is proposed, a novel training- and condition-free approach that leverages the energy-based perspective of the self-attention mechanism to enhance image generation and achieves a Pareto improvement in both quality and the reduction of side effects.

Abstract

Conditional diffusion models have shown remarkable success in visual content generation, producing high-quality samples across various domains, largely due to classifier-free guidance (CFG). Recent attempts to extend guidance to unconditional models have relied on heuristic techniques, resulting in suboptimal generation quality and unintended effects. In this work, we propose Smoothed Energy Guidance (SEG), a novel training- and condition-free approach that leverages the energy-based perspective of the self-attention mechanism to enhance image generation. By defining the energy of self-attention, we introduce a method to reduce the curvature of the energy landscape of attention and use the output as the unconditional prediction. Practically, we control the curvature of the energy landscape by adjusting the Gaussian kernel parameter while keeping the guidance scale parameter fixed. Additionally, we present a query blurring method that is equivalent to blurring the entire attention weights without incurring quadratic complexity in the number of tokens. In our experiments, SEG achieves a Pareto improvement in both quality and the reduction of side effects. The code is available at https://github.com/SusungHong/SEG-SDXL.

Smoothed Energy Guidance: Guiding Diffusion Models with Reduced Energy Curvature of Attention

TL;DR

Smoothed Energy Guidance is proposed, a novel training- and condition-free approach that leverages the energy-based perspective of the self-attention mechanism to enhance image generation and achieves a Pareto improvement in both quality and the reduction of side effects.

Abstract

Conditional diffusion models have shown remarkable success in visual content generation, producing high-quality samples across various domains, largely due to classifier-free guidance (CFG). Recent attempts to extend guidance to unconditional models have relied on heuristic techniques, resulting in suboptimal generation quality and unintended effects. In this work, we propose Smoothed Energy Guidance (SEG), a novel training- and condition-free approach that leverages the energy-based perspective of the self-attention mechanism to enhance image generation. By defining the energy of self-attention, we introduce a method to reduce the curvature of the energy landscape of attention and use the output as the unconditional prediction. Practically, we control the curvature of the energy landscape by adjusting the Gaussian kernel parameter while keeping the guidance scale parameter fixed. Additionally, we present a query blurring method that is equivalent to blurring the entire attention weights without incurring quadratic complexity in the number of tokens. In our experiments, SEG achieves a Pareto improvement in both quality and the reduction of side effects. The code is available at https://github.com/SusungHong/SEG-SDXL.
Paper Structure (32 sections, 4 theorems, 27 equations, 20 figures, 2 tables)

This paper contains 32 sections, 4 theorems, 27 equations, 20 figures, 2 tables.

Key Result

Lemma 3.1

Spatially applying a 2D Gaussian blur to the attention weights $\mathbf{a} := \mathbf{Q}\mathbf{k}^\top$ preserves the average $\mathbb{E}_{i, j}[a_{(i,j)}]$. In addition, the variance monotonically decreases every time we apply the Gaussian blur.

Figures (20)

  • Figure 1: Teaser. (a) Images sampled from vanilla SDXL podell2023sdxl without any guidance. (b) Images sampled with Smoothed Energy Guidance (Ours). $\varnothing$ denotes that there is no condition given. With various input conditions, and even without any, SEG supports the diffusion model in generating plausible and high-quality images without any training.
  • Figure 2: Unconditional generation using SEG.
  • Figure 3: Text-conditional generation using SEG.
  • Figure 4: Conditional generation using ControlNet zhang2023adding and SEG.
  • Figure 5: Qualitative comparison of SEG with vanilla SDXL podell2023sdxl, SAG hong2023improving, and PAG ahn2024self.
  • ...and 15 more figures

Theorems & Definitions (6)

  • Definition 2.1: Energy Function for Self-Attention
  • Lemma 3.1
  • Lemma 3.2
  • Theorem 3.1
  • Proposition 3.1
  • proof