Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps
Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz
TL;DR
This work tackles reward alignment for diffusion- and flow-based generative models by embedding reward adaptability into the model itself via Diamond Maps, stochastic flow maps that enable efficient, at-inference-time optimization to arbitrary rewards. It introduces two designs: Posterior Diamond Maps, which distill the posterior $p_{1|t}(\cdot|x_t)$ into a one-step sampler and allow consistent estimation of the value function $V_t^r(x_t)$ and its gradient, and Weighted Diamond Maps, which convert standard flow maps into stochastic estimators using a local recovery reward and score corrections for fast, scalable guidance. The authors show that Posterior Diamond Maps can sample from the posterior exactly and support Diamond DDPM sampling, while Weighted Diamond Maps offer a plug-in approach to leverage existing flow maps with efficient estimators. Experiments across CIFAR-10, CelebA-64, and high-resolution text-to-image tasks demonstrate faster, more robust reward alignment with competitive Pareto-frontier performance against strong baselines, highlighting the practical potential for rapid, inference-time adaptation to diverse rewards. Overall, this work provides a practical route to generative models that can be rapidly aligned to arbitrary preferences and constraints at inference time, without extensive retraining.
Abstract
Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.
