Table of Contents
Fetching ...

Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models

Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Sangwu Lee, Sayak Paul, Susung Hong, Seungryong Kim

TL;DR

This paper addresses the challenge of guiding diffusion and flow models in unconditional generation by focusing perturbations at the level of attention heads within Diffusion Transformers. It introduces HeadHunter, an iterative head selection framework that greedily constructs a set of attention heads whose perturbation aligns with user defined objectives, and SoftPAG, a continuous interpolation mechanism that modulates perturbation strength by mixing the original attention with the identity matrix. The authors show that head level perturbations reveal semantically interpretable concepts, enable compositional control over structure and style, and outperform traditional layer level perturbation in both general quality and style-specific tasks on SD3 and FLUX. The work also demonstrates the generalizability of HeadHunter to unseen prompts and provides a unified perspective on attention based perturbations through SoftPAG and related perturbation strategies. Overall, the approach offers interpretable, modular, and controllable intervention tools for diffusion based image synthesis with practical implications for robust, style-aware generation.

Abstract

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models

TL;DR

This paper addresses the challenge of guiding diffusion and flow models in unconditional generation by focusing perturbations at the level of attention heads within Diffusion Transformers. It introduces HeadHunter, an iterative head selection framework that greedily constructs a set of attention heads whose perturbation aligns with user defined objectives, and SoftPAG, a continuous interpolation mechanism that modulates perturbation strength by mixing the original attention with the identity matrix. The authors show that head level perturbations reveal semantically interpretable concepts, enable compositional control over structure and style, and outperform traditional layer level perturbation in both general quality and style-specific tasks on SD3 and FLUX. The work also demonstrates the generalizability of HeadHunter to unseen prompts and provides a unified perspective on attention based perturbations through SoftPAG and related perturbation strategies. Overall, the approach offers interpretable, modular, and controllable intervention tools for diffusion based image synthesis with practical implications for robust, style-aware generation.

Abstract

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose "HeadHunter", a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head's attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

Paper Structure

This paper contains 66 sections, 14 equations, 38 figures, 4 tables, 1 algorithm.

Figures (38)

  • Figure 1: Motivating example. Each image is generated with PAG ahn2024self, where perturbation is applied to a single attention head within DiTs. Guiding with different perturbed attention heads produces notably distinct results. Results for additional heads are provided in Appendix \ref{['sup:subsec:all-heads']}. All images are generated with the prompt "smiling girl holding a cat, in a flower garden" using stable-diffusion-3-medium. Each row corresponds to a single layer, with different heads perturbed across columns.
  • Figure 1: Quantitative evaluation of generalizability to unseen content prompts. Number in the parenthesis denotes guidance scale $w$. Applying HeadHunter (style-oriented quality setting) to unseen content prompts demonstrates strong generalization, yielding significantly higher human preference scores than the baseline and performance comparable to CFG.
  • Figure 2: Generated images from head- and layer-level perturbation guidance. (a) Results of layer-level perturbation guidance, where perturbation is applied to all heads in the layer. (b) Results of head-level perturbation guidance, where each result is obtained by independently applying perturbation to a single head of the layer. Red boxes indicate high-performing heads in terms of PickScore kirstain2023pick. (c) Perturbation guidance using only the high-performing heads identified in (b) yields higher-quality generations across both low-performing layers (L13) and well-performing ones (L12). The prompt “Turkish girl with lantern, dark room” is used.
  • Figure 3: Effect of head-level guidance on concept amplification and combination. (a) Guiding with individual heads amplifies specific visual concepts such as darkness, geometry, shadow, or color. (b) Guiding with two heads simultaneously combines their effects in the output.
  • Figure 4: Results of HeadHunter for general image quality improvement. Performance improves as more top-ranked heads are added, demonstrating the effectiveness of compositional head selection via HeadHunter. The dotted horizontal line in (b) indicates the best score achieved by layer-level guidance, which is surpassed by a compact set of top-$k$ heads for $k < 10$. Dashed lines indicate the FID of the best-performing layer-level perturbation for each guidance scale $w$ in Eq. \ref{['eq:attention-perturbation-guidance']}.
  • ...and 33 more figures