Table of Contents
Fetching ...

EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance

Ankit Yadav, Ta Duc Huy, Lingqiao Liu

TL;DR

This work tackles the limitations of classifier-free guidance by introducing Exponential Moving Average Guidance (EMAG), a training-free method that perturb-attends via attention-space EMA and adaptive layer selection to generate semantically faithful hard negatives. By controlling high-frequency degradation through EMA and targeting middle layers, EMAG surfaces subtle failure modes that the denoiser can correct, boosting human-preference scores while preserving global structure. EMAG is shown to consistently improve HPS across conditional and unconditional tasks, and it integrates smoothly with advanced guidance strategies like APG and CADS, offering additive gains without sacrificing other quality metrics. The approach, validated on transformer-based diffusion backbones (DiT-XL/2 and SD3-Medium), provides a practical, plug-in enhancement for high-fidelity image synthesis with strong alignment to human judgments.

Abstract

In diffusion and flow-matching generative models, guidance techniques are widely used to improve sample quality and consistency. Classifier-free guidance (CFG) is the de facto choice in modern systems and achieves this by contrasting conditional and unconditional samples. Recent work explores contrasting negative samples at inference using a weaker model, via strong/weak model pairs, attention-based masking, stochastic block dropping, or perturbations to the self-attention energy landscape. While these strategies refine the generation quality, they still lack reliable control over the granularity or difficulty of the negative samples, and target-layer selection is often fixed. We propose Exponential Moving Average Guidance (EMAG), a training-free mechanism that modifies attention at inference time in diffusion transformers, with a statistics-based, adaptive layer-selection rule. Unlike prior methods, EMAG produces harder, semantically faithful negatives (fine-grained degradations), surfacing difficult failure modes, enabling the denoiser to refine subtle artifacts, boosting the quality and human preference score (HPS) by +0.46 over CFG. We further demonstrate that EMAG naturally composes with advanced guidance techniques, such as APG and CADS, further improving HPS.

EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance

TL;DR

This work tackles the limitations of classifier-free guidance by introducing Exponential Moving Average Guidance (EMAG), a training-free method that perturb-attends via attention-space EMA and adaptive layer selection to generate semantically faithful hard negatives. By controlling high-frequency degradation through EMA and targeting middle layers, EMAG surfaces subtle failure modes that the denoiser can correct, boosting human-preference scores while preserving global structure. EMAG is shown to consistently improve HPS across conditional and unconditional tasks, and it integrates smoothly with advanced guidance strategies like APG and CADS, offering additive gains without sacrificing other quality metrics. The approach, validated on transformer-based diffusion backbones (DiT-XL/2 and SD3-Medium), provides a practical, plug-in enhancement for high-fidelity image synthesis with strong alignment to human judgments.

Abstract

In diffusion and flow-matching generative models, guidance techniques are widely used to improve sample quality and consistency. Classifier-free guidance (CFG) is the de facto choice in modern systems and achieves this by contrasting conditional and unconditional samples. Recent work explores contrasting negative samples at inference using a weaker model, via strong/weak model pairs, attention-based masking, stochastic block dropping, or perturbations to the self-attention energy landscape. While these strategies refine the generation quality, they still lack reliable control over the granularity or difficulty of the negative samples, and target-layer selection is often fixed. We propose Exponential Moving Average Guidance (EMAG), a training-free mechanism that modifies attention at inference time in diffusion transformers, with a statistics-based, adaptive layer-selection rule. Unlike prior methods, EMAG produces harder, semantically faithful negatives (fine-grained degradations), surfacing difficult failure modes, enabling the denoiser to refine subtle artifacts, boosting the quality and human preference score (HPS) by +0.46 over CFG. We further demonstrate that EMAG naturally composes with advanced guidance techniques, such as APG and CADS, further improving HPS.

Paper Structure

This paper contains 48 sections, 18 equations, 17 figures, 16 tables, 2 algorithms.

Figures (17)

  • Figure 1: CFG (left) vs. EMAG (right) Compared to CFG ho2022classifier, EMAG produces more semantically plausible images, preserving global structure while sharpening fine details and suppressing minor artifacts. When applied to SD3-Medium, EMAG’s outputs align more closely with human-preference proxies (HPS wu2023human).
  • Figure 2: Negative-sample comparison (top: negative samples; bottom: positive samples). Prior guidance (SAG hong2023improving, auto guidance karras2024guiding, ERG ifriqi2025entropy, S$^2$-Guidance chen2025s) often yield obvious “easy” degradations. In contrast, EMAG produces subtle, semantically near-miss negatives that reveal difficult failure modes yet retain global structure, enabling finer refinement. (Samples from DiT)
  • Figure 3: CFG(left) vs EMAG hard negatives(right) EMAG samples at varying $\beta$; larger $\beta$ produces stronger degradations.
  • Figure 4: Qualitative comparison for (a) Unconditional and (b) Text-conditional settings.
  • Figure 5: HPS and FID vs guidance scale (1000 samples). Colors denote EMAG scales 1.25, 1.5, 1.75, 2.0. (a) HPS ↑. (b) FID ↓. Same split, steps, and evaluation as main experiments.
  • ...and 12 more figures