Table of Contents
Fetching ...

Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance

Kwanyoung Kim

TL;DR

Guidance in diffusion models has largely relied on heuristic perturbations with limited theoretical grounding. This work introduces Adversarial Sinkhorn Attention Guidance (ASAG), an OT-based framework that adversarially perturbs self-attention via an entropy-maximizing Sinkhorn plan to disrupt misleading attention without retraining. The authors provide both theoretical and practical justification, including entropy-maximizing transport theory and a finite-Sinkhorn approximation, and demonstrate state-of-the-art improvements in unconditional and conditional image generation, as well as enhanced performance when combined with ControlNet and IP-Adapter. ASAG is lightweight, plug-and-play, and broadly applicable across diffusion backbones, offering a principled path toward more reliable diffusion sampling.

Abstract

Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.

Toward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance

TL;DR

Guidance in diffusion models has largely relied on heuristic perturbations with limited theoretical grounding. This work introduces Adversarial Sinkhorn Attention Guidance (ASAG), an OT-based framework that adversarially perturbs self-attention via an entropy-maximizing Sinkhorn plan to disrupt misleading attention without retraining. The authors provide both theoretical and practical justification, including entropy-maximizing transport theory and a finite-Sinkhorn approximation, and demonstrate state-of-the-art improvements in unconditional and conditional image generation, as well as enhanced performance when combined with ControlNet and IP-Adapter. ASAG is lightweight, plug-and-play, and broadly applicable across diffusion backbones, offering a principled path toward more reliable diffusion sampling.

Abstract

Diffusion models have demonstrated strong generative performance when using guidance methods such as classifier-free guidance (CFG), which enhance output quality by modifying the sampling trajectory. These methods typically improve a target output by intentionally degrading another, often the unconditional output, using heuristic perturbation functions such as identity mixing or blurred conditions. However, these approaches lack a principled foundation and rely on manually designed distortions. In this work, we propose Adversarial Sinkhorn Attention Guidance (ASAG), a novel method that reinterprets attention scores in diffusion models through the lens of optimal transport and intentionally disrupt the transport cost via Sinkhorn algorithm. Instead of naively corrupting the attention mechanism, ASAG injects an adversarial cost within self-attention layers to reduce pixel-wise similarity between queries and keys. This deliberate degradation weakens misleading attention alignments and leads to improved conditional and unconditional sample quality. ASAG shows consistent improvements in text-to-image diffusion, and enhances controllability and fidelity in downstream applications such as IP-Adapter and ControlNet. The method is lightweight, plug-and-play, and improves reliability without requiring any model retraining.

Paper Structure

This paper contains 30 sections, 5 theorems, 25 equations, 12 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Let $\mathbf{Q}_t, \mathbf{K}_t \in \mathbb{R}^{n \times d}$ be the query and key matrices at diffusion timestep $t$. Define the adversarial cost matrix as $\mathbf{M}_t^{\downarrow} = (\mathbf{Q}_t\mathbf{K}_t^{\top})$. The entropy-regularized OT problem is defined as $d^{\lambda}_{\mathbf{M}_t^{\d

Figures (12)

  • Figure 1: Qualitative comparison. (a) unconditional generation, (b) conditional generation with other guidance sampling methods, and (c) conditional generation using ControlNet and IP-Adapter. Our method, ASAG, significantly improves visual quality in both unconditional and conditional settings. It also remarkably enhances external frameworks like ControlNet and IP-Adapter. Crucially, ASAG requires no additional training, making it broadly compatible and readily deployable.
  • Figure 2: Conceptual comparison between ASAG and other guidance methods. Existing guidance methods often rely on null conditions or heuristic perturbations of self-attention, such as injecting identity matrices or applying Gaussian blurs, to simulate undesirable paths. In contrast, ASAG explicitly defines an attention cost function based on pixel-level interactions and disrupts attention scores by minimizing this cost, thereby intentionally breaking semantic interactions through the Sinkhorn algorithm.
  • Figure 3: Comparison results on (A) unconditional and (B) conditional generation using Vanilla, CFG, PAG, SEG, and ASAG. While other guidance methods often alter the structure of the original outputs, ASAG achieves both higher visual quality and stronger consistency in structure and intent.
  • Figure 4: ControlNet examples with different guidance sampling methods. Left: Canny condition; Right: Depth condition. Our method, when integrated with ControlNet, substantially improves visual quality and preserves fine-grained image details.
  • Figure 5: Comparison of guidance sampling methods combined with ControlNet and IP-Adapter under pose and depth conditions. Our method significantly enhances image quality, yielding clearer structures.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Theorem 1: Entropy-Maximizing Plan via Adversarial Sinkhorn
  • Lemma 1: Uniform Plan Maximizes Entropy
  • Remark 1
  • Corollary 1.1
  • Theorem 1: Entropy-Maximizing Plan via Adversarial Sinkhorn
  • proof
  • Lemma 1: Uniform Plan Maximizes Entropy
  • proof