Table of Contents
Fetching ...

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz

TL;DR

This work tackles reward alignment for diffusion- and flow-based generative models by embedding reward adaptability into the model itself via Diamond Maps, stochastic flow maps that enable efficient, at-inference-time optimization to arbitrary rewards. It introduces two designs: Posterior Diamond Maps, which distill the posterior $p_{1|t}(\cdot|x_t)$ into a one-step sampler and allow consistent estimation of the value function $V_t^r(x_t)$ and its gradient, and Weighted Diamond Maps, which convert standard flow maps into stochastic estimators using a local recovery reward and score corrections for fast, scalable guidance. The authors show that Posterior Diamond Maps can sample from the posterior exactly and support Diamond DDPM sampling, while Weighted Diamond Maps offer a plug-in approach to leverage existing flow maps with efficient estimators. Experiments across CIFAR-10, CelebA-64, and high-resolution text-to-image tasks demonstrate faster, more robust reward alignment with competitive Pareto-frontier performance against strong baselines, highlighting the practical potential for rapid, inference-time adaptation to diverse rewards. Overall, this work provides a practical route to generative models that can be rapidly aligned to arbitrary preferences and constraints at inference time, without extensive retraining.

Abstract

Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

TL;DR

This work tackles reward alignment for diffusion- and flow-based generative models by embedding reward adaptability into the model itself via Diamond Maps, stochastic flow maps that enable efficient, at-inference-time optimization to arbitrary rewards. It introduces two designs: Posterior Diamond Maps, which distill the posterior into a one-step sampler and allow consistent estimation of the value function and its gradient, and Weighted Diamond Maps, which convert standard flow maps into stochastic estimators using a local recovery reward and score corrections for fast, scalable guidance. The authors show that Posterior Diamond Maps can sample from the posterior exactly and support Diamond DDPM sampling, while Weighted Diamond Maps offer a plug-in approach to leverage existing flow maps with efficient estimators. Experiments across CIFAR-10, CelebA-64, and high-resolution text-to-image tasks demonstrate faster, more robust reward alignment with competitive Pareto-frontier performance against strong baselines, highlighting the practical potential for rapid, inference-time adaptation to diverse rewards. Overall, this work provides a practical route to generative models that can be rapidly aligned to arbitrary preferences and constraints at inference time, without extensive retraining.

Abstract

Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.
Paper Structure (40 sections, 3 theorems, 81 equations, 8 figures, 5 tables, 3 algorithms)

This paper contains 40 sections, 3 theorems, 81 equations, 8 figures, 5 tables, 3 algorithms.

Key Result

Proposition 4.1

For $\bar{x}_0^k\sim \mathcal{N}(0,I_d)$ and $z^k=X_{0,1}(\bar{x}_0^k|x_t,t)$, the following are consistent estimators of the value function and its gradient:

Figures (8)

  • Figure 1: Overview of Diamond Maps. Diamond Maps are stochastic flow maps that allow to perform one-step "look-aheads" of a flow trajectory (blue) to potential endpoints at time $1$ to evaluate a reward $r$. This allows for efficient exploration, search, and guidance. We propose 2 Diamond Map Designs. Posterior Diamond Maps (left) distill GLASS Flows into a flow map $X_{s,r}(\bar{x}|x_t,t)$ designed to sample exact samples from the posterior. Weighted Diamond Maps (middle) allow to use standard flow maps by making them stochastic via a simple renoising procedure with. ESS: effective sample size. Right: Improved image alignment with Diamond Maps. Prompts: "A diamond, a folded treasure map, a compass, and a dagger". "A laptop on top of a teddy bear".
  • Figure 2: Effective time $0\to r^*$ amortized in the flow map is significantly smaller for Diamond Early Stop DDPM sampling ("Early stop") than for iterative denoising and noising ("Renoise") leading to reduced error accumulation (see \ref{['fig:diamond_maps_sampling']}). Here, we plot $r^*$ as a function $t,t'$ (see \ref{['appendix:r_star_formula']}).
  • Figure 3: Illustration of \ref{['prop:weighted_diamond_flow_estimator']} with a blueness reward (i.e. reward is maximized for full blue image). Left: Sampling from SANA-Sprint model with no reward. Middle: Naive re-noising and reward gradient with no weighting (\ref{['eq:renoise']}). The entire image becomes blue close to collapse. Right: Corrected gradient value function via \ref{['prop:weighted_diamond_flow_estimator']}: image remains more realistic due to the added regularization preventing drift off the data manifold.
  • Figure 4: Training and Sampling Posterior Diamond Maps. Left: Example of one-step posterior samples from Posterior Diamond Maps. The one-step samplers are faithful and of high quality. Middle: Quantitative results for posterior sampling for various times $t$. Posterior Diamond Maps outperforms GLASS Flows and therefore have successively distilled them. Top right: In contrast to flow maps that always return the same sample, Posterior Diamond Maps are stochastic and allow to explore future possibilities (sample from posterior). Bottom right: Iterative sampling leads to error accumulation via a iterative denoising and noising scheme (\ref{['subsec:sampling_posterior_diamond_maps']}), while improves for Diamond Early Stop DDPM sampling.
  • Figure 5: Illustration of guidance with Weighted Diamond Maps. Trajectories of guidance (see \ref{['alg:guidance_with_weighted_diamond_maps']}) are plotted after $N$ steps ($x_1$-prediction). Top: Prompt is "a matte red mech suit under dramatic key lights, a sleek electric blue drone hovering nearby, and a cylindrical maintenance pod with glowing panels". Note: The drone only appears after 4 guidance steps. Middle: "A toaster riding a bike". Guidance removed artifacts and increases adherence to prompt.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Proposition 4.1
  • Remark 4.2: Training from scratch
  • Proposition 4.3: Diamond DDPM sampling
  • Proposition 5.1: Weighted Diamond Map
  • proof
  • proof
  • proof : Proof of \ref{['prop:ddpm_transition_via_posterior_flow_map']}
  • proof