Table of Contents
Fetching ...

Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models

Min Cheng, Fatemeh Doudi, Dileep Kalathil, Mohammad Ghavamzadeh, Panganamala R. Kumar

TL;DR

Diffusion Blend introduces inference-time multi-preference alignment for diffusion models by blending backward diffusion processes corresponding to basis rewards. Two algorithms, DB-MPA (multi-reward) and DB-KLA (KL-regularization control), leverage a Jensen-gap-based approximation to express the target drift as a linear combination of basis drifts, enabling user-specified $r(w)$ with $\alpha(\lambda)$ at inference without additional fine-tuning. Experimental results on SDv1.5 with multiple rewards show DB-MPA and DB-KLA outperform baselines and closely approach MORL oracle performance, while offering smooth, real-time control over outputs. This framework reduces computational cost and enables personalized, policy-driven diffusion generation, with memory overhead as a noted area for future efficiency improvements.

Abstract

Reinforcement learning (RL) algorithms have been used recently to align diffusion models with downstream objectives such as aesthetic quality and text-image consistency by fine-tuning them to maximize a single reward function under a fixed KL regularization. However, this approach is inherently restrictive in practice, where alignment must balance multiple, often conflicting objectives. Moreover, user preferences vary across prompts, individuals, and deployment contexts, with varying tolerances for deviation from a pre-trained base model. We address the problem of inference-time multi-preference alignment: given a set of basis reward functions and a reference KL regularization strength, can we design a fine-tuning procedure so that, at inference time, it can generate images aligned with any user-specified linear combination of rewards and regularization, without requiring additional fine-tuning? We propose Diffusion Blend, a novel approach to solve inference-time multi-preference alignment by blending backward diffusion processes associated with fine-tuned models, and we instantiate this approach with two algorithms: DB-MPA for multi-reward alignment and DB-KLA for KL regularization control. Extensive experiments show that Diffusion Blend algorithms consistently outperform relevant baselines and closely match or exceed the performance of individually fine-tuned models, enabling efficient, user-driven alignment at inference-time. The code is available at https://github.com/bluewoods127/DB-2025}{github.com/bluewoods127/DB-2025.

Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models

TL;DR

Diffusion Blend introduces inference-time multi-preference alignment for diffusion models by blending backward diffusion processes corresponding to basis rewards. Two algorithms, DB-MPA (multi-reward) and DB-KLA (KL-regularization control), leverage a Jensen-gap-based approximation to express the target drift as a linear combination of basis drifts, enabling user-specified with at inference without additional fine-tuning. Experimental results on SDv1.5 with multiple rewards show DB-MPA and DB-KLA outperform baselines and closely approach MORL oracle performance, while offering smooth, real-time control over outputs. This framework reduces computational cost and enables personalized, policy-driven diffusion generation, with memory overhead as a noted area for future efficiency improvements.

Abstract

Reinforcement learning (RL) algorithms have been used recently to align diffusion models with downstream objectives such as aesthetic quality and text-image consistency by fine-tuning them to maximize a single reward function under a fixed KL regularization. However, this approach is inherently restrictive in practice, where alignment must balance multiple, often conflicting objectives. Moreover, user preferences vary across prompts, individuals, and deployment contexts, with varying tolerances for deviation from a pre-trained base model. We address the problem of inference-time multi-preference alignment: given a set of basis reward functions and a reference KL regularization strength, can we design a fine-tuning procedure so that, at inference time, it can generate images aligned with any user-specified linear combination of rewards and regularization, without requiring additional fine-tuning? We propose Diffusion Blend, a novel approach to solve inference-time multi-preference alignment by blending backward diffusion processes associated with fine-tuned models, and we instantiate this approach with two algorithms: DB-MPA for multi-reward alignment and DB-KLA for KL regularization control. Extensive experiments show that Diffusion Blend algorithms consistently outperform relevant baselines and closely match or exceed the performance of individually fine-tuned models, enabling efficient, user-driven alignment at inference-time. The code is available at https://github.com/bluewoods127/DB-2025}{github.com/bluewoods127/DB-2025.

Paper Structure

This paper contains 25 sections, 5 theorems, 29 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

Let $f^{(r,\alpha)}$ and $f^{\mathrm{pre}}$ be as specified in eq:backward-diffusion-after-ft and eq:backward-diffusion, respectively. Then, $f^{(r,\alpha)}(x_{t}, t) = f^{\mathrm{pre}}(x_{t}, t) + u^{(r,\alpha)}(x_{t}, t)$, where

Figures (11)

  • Figure 1: (a). Overview of our Diffusion Blend-Multi-Preference Alignment (DB-MPA) Algorithm. Given two basis reward functions, such as text-image alignment score and aesthetic score, for any preference weight $w = (w_{1}, w_{2})$ specified by the user at the inference time, DB-MPA generates images aligned with reward $r(w) = w_{1} r_{1} + w_{2} r_{2}$. (b) During the fine-tuning stage, DB-MPA gets an RL fine-tuned model corresponding to each reward function. (c) During the inference time, DB-MPA blends (mixes) the backward diffusion corresponding to each fine-tuned model according to the user-specified preference $w$.
  • Figure 2: Comparison of the DB-MPA algorithm with relevant baselines: Stable Diffusion v1.5 rombach2022stablediffusion, CoDe singh2025code, reward gradient-based guidance (RGG) chung2023dps, rewarded soup (RS) rame2023rewarded, and Multi-Objective RL (MORL) roijers2013survey. Note that MORL is included only to illustrate the maximum achievable performance by an oracle algorithm. See \ref{['sec:related-work']} for more details about the baselines. For any preference weight $w$ specified by a user at inference-time, the goal of these algorithms is to generate images aligned with reward $r(w)=w r_{1} + (1-w) r_{2}$, where $r_{1}$ is the text-to-image alignment score and $r_{2}$ is the aesthetics score. (a) Visual comparison of the images generated for the prompt 'a blue colored apple' for $w \in \{0.2, 0.5, 0.8\}$. (b) Pareto-front comparison of DB-MPA algorithm with other baselines evaluated for different $w$s. The Pareto-front of DB-MPA is significantly better than all relevant baselines and is close to the empirical upper bound achieved by MORL.
  • Figure 3: (a) Overview of our Diffusion Blend-KL Alignment (DB-KLA) Algorithm. Given an RL fine-tuned model for reward $r$ and KL regularization weight $\alpha$, for any regularization modification factor $\lambda$ specified by the user at inference-time, the DB-KLA algorithm generates images aligned with $r$, and with a KL regularization weight $\alpha/\lambda$. (b) During the inference time, DB-KLA blends (mixes) the backward diffusion corresponding to the fine-tuned model and the pretrained model according to $\lambda$, which can be larger than $1$. (c) Visual comparisons of the images generated by the DB-KLA algorithm and $\lambda$-specific RL fine-tuned models for the prompt 'a red apple and a purple backpack'. We use text-to-image-alignment as the reward function and consider $\lambda \in \{0.2, 1.0, 2.0\}$. DB-KLA can perform smooth control over the image generation by moving the effective model away from the pre-trained model via increasing $\lambda$. Even without any additional fine-tuning, the images generated by DB-KLA are similar to those of $\lambda$-specific RL fine-tuned models.
  • Figure 4: Illustration of the smooth control of DB-MPA to generate images aligned with $r(w)$ for any $w \in [0, 1]$. DB-MPA generates images that are better aligned with both rewards, especially for $w \in [0.4, 0.8]$. RS generates images with wrong interpretation objects (orange) or missing objects (cellphone).
  • Figure 5: Comparison of the images generated by DB-MPA and baseline algorithms for $w \in \{0.2, 0.5, 0.8\}$.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Proposition 1
  • Remark 1
  • Lemma 1
  • Remark 2
  • Lemma 2
  • proof : Proof of Proposition \ref{['prop:fpost-fpre-relationship']}
  • Proposition 2: General statement of \ref{['prop:fpost-fpre-relationship']}
  • proof
  • Remark 3
  • Lemma 3: Restatement of \ref{['lemma:approx-error']}
  • ...and 2 more