Table of Contents
Fetching ...

Inference-Time Alignment of Diffusion Models via Evolutionary Algorithms

Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiruvathukal, James C. Davis, Yung-Hsiang Lu

TL;DR

This work addresses aligning diffusion-model outputs to downstream objectives without access to gradients or internal states by proposing inference-time, black-box optimization over latent noise using evolutionary algorithms. It introduces two latent-space formulations (direct noise search and noise-transform search) and two representative EA families (Genetic Algorithms and Natural Evolutionary Strategies), evaluated across DrawBench and Open Image Preferences with multiple reward functions. Results show that evolutionary approaches, particularly CoSyNE and SNES, often outperform gradient-based and gradient-free baselines in short-horizon settings while offering substantial memory and speed advantages, and they remain compatible with fine-tuning alignment methods. The findings highlight a scalable, model-agnostic toolkit for practical diffusion-model alignment, with caveats around long-horizon optimization and potential reward hacking, motivating future EA-tailored designs for extended horizons.

Abstract

Diffusion models are state-of-the-art generative models, yet their samples often fail to satisfy application objectives such as safety constraints or domain-specific validity. Existing techniques for alignment require gradients, internal model access, or large computational budgets resulting in high compute demands, or lack of support for certain objectives. In response, we introduce an inference-time alignment framework based on evolutionary algorithms. We treat diffusion models as black boxes and search their latent space to maximize alignment objectives. Given equal or less running time, our method achieves 3-35% higher ImageReward scores than gradient-free and gradient-based methods. On the Open Image Preferences dataset, our method achieves competitive results across four popular alignment objectives. In terms of computational efficiency, we require 55% to 76% less GPU memory and are 72% to 80% faster than gradient-based methods.

Inference-Time Alignment of Diffusion Models via Evolutionary Algorithms

TL;DR

This work addresses aligning diffusion-model outputs to downstream objectives without access to gradients or internal states by proposing inference-time, black-box optimization over latent noise using evolutionary algorithms. It introduces two latent-space formulations (direct noise search and noise-transform search) and two representative EA families (Genetic Algorithms and Natural Evolutionary Strategies), evaluated across DrawBench and Open Image Preferences with multiple reward functions. Results show that evolutionary approaches, particularly CoSyNE and SNES, often outperform gradient-based and gradient-free baselines in short-horizon settings while offering substantial memory and speed advantages, and they remain compatible with fine-tuning alignment methods. The findings highlight a scalable, model-agnostic toolkit for practical diffusion-model alignment, with caveats around long-horizon optimization and potential reward hacking, motivating future EA-tailored designs for extended horizons.

Abstract

Diffusion models are state-of-the-art generative models, yet their samples often fail to satisfy application objectives such as safety constraints or domain-specific validity. Existing techniques for alignment require gradients, internal model access, or large computational budgets resulting in high compute demands, or lack of support for certain objectives. In response, we introduce an inference-time alignment framework based on evolutionary algorithms. We treat diffusion models as black boxes and search their latent space to maximize alignment objectives. Given equal or less running time, our method achieves 3-35% higher ImageReward scores than gradient-free and gradient-based methods. On the Open Image Preferences dataset, our method achieves competitive results across four popular alignment objectives. In terms of computational efficiency, we require 55% to 76% less GPU memory and are 72% to 80% faster than gradient-based methods.

Paper Structure

This paper contains 45 sections, 7 equations, 17 figures, 13 tables, 2 algorithms.

Figures (17)

  • Figure 1: Samples generated by our method on Stable Diffusion-3. Each row shows the progression over optimization steps $i$, with the corresponding metric values displayed in the top-right corner. The noise in each generation is optimized via our evolutionary-based approach, which uses the CoSyNE algorithm gomez2008accelerated over 14 optimization steps. Arrows ($\uparrow$/$\downarrow$) indicate whether the metric is being maximized or minimized.
  • Figure 2: Mapping between \ref{['eq:evo_general']} and \ref{['alg:noise_search']} search over $z_T$ directly, or an affine transform of $z_T$. We depict connections between \ref{['eq:evo_general']} and our method via color-coding. We perform alignment on human preferences (HPSv2, ImageReward), JPEG size, and CLIP scores.
  • Figure 3: We depict how GA and ES search the latent noise space (a Gaussian hypersphere gaussian_annulus_lecture_2017) over optimization steps $t$. GAs maintain an empirical population of solutions $q_\phi=\{\psi_0...\psi_N\}$ while ES maintains a distribution $q_\phi=\mathcal{N}(\mu_i,\sigma_i)$ from which we sample.
  • Figure 4: Reward Standard Deviation per Step (Diversity). CoSyNE (GA) has rapidly diminishing diversity, while SNES (ES) maintains diversity over optimization steps, as its search distribution evolves (\ref{['sec:method:evo_genetic']}). This implies that GAs are suited for short-term optimization, and ES for long-term optimization. N.B. We use standard deviation as a proxy for solution diversity.
  • Figure 5: Computational characteristics of inference‑time alignment methods on Stable Diffusion-1.5. In terms of memory usage our evolutionary methods scale with batch size, whereas DNO does not and exhausts memory for batch sizes $\geq16$. Our methods exhibit near-constant or decreasing per‑step latency as batch size increases, indicating that they can be efficiently batched. For DNO, $P=B$.
  • ...and 12 more figures