Diffusion Alignment as Variational Expectation-Maximization
Jaewoo Lee, Minsu Kim, Sanghyeok Choi, Inhyuck Song, Sujin Yun, Hyeongyu Kang, Woocheol Shin, Taeyoung Yun, Kiyoung Om, Jinkyoo Park
TL;DR
This work introduces Diffusion Alignment as Variational EM (DAV), a principled framework that alternates between an E-step of test-time search for reward-aligned, diverse trajectories and an M-step that distills these trajectories into the diffusion model via forward KL minimization. By modeling alignment as a variational inference problem with a discount factor, DAV achieves multi-modal reward alignment without succumbing to mode collapse or reward over-optimization. It is demonstrated on both continuous diffusion for text-to-image synthesis and discrete diffusion for DNA sequence design, showing improved reward metrics while preserving alignment, naturalness, and diversity. The approach is modular, extends to non-differentiable rewards, and offers a general pathway for robust downstream optimization of diffusion models in diverse domains.
Abstract
Diffusion alignment aims to optimize diffusion models for the downstream objective. While existing methods based on reinforcement learning or direct backpropagation achieve considerable success in maximizing rewards, they often suffer from reward over-optimization and mode collapse. We introduce Diffusion Alignment as Variational Expectation-Maximization (DAV), a framework that formulates diffusion alignment as an iterative process alternating between two complementary phases: the E-step and the M-step. In the E-step, we employ test-time search to generate diverse and reward-aligned samples. In the M-step, we refine the diffusion model using samples discovered by the E-step. We demonstrate that DAV can optimize reward while preserving diversity for both continuous and discrete tasks: text-to-image synthesis and DNA sequence design.
