Table of Contents
Fetching ...

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

Jason Bohne, Pawel Polak, David Rosenberg, Brian Bloniarz, Gary Kazantsev

TL;DR

Mix- and MoE-DPO address the limitation of single-policy Direct Preference Optimization (DPO) by modeling the policy as a latent mixture over $K$ experts, with a soft context-dependent gating $w_k(x)$ and a corresponding mixture reward $r(x,y) = \log \sum_{k=1}^K w_k(x) \exp(r_k(x,y))$. The framework leverages a variational EM approach on a Mixture-of-Bradley–Terry (MBT) likelihood to learn expert rewards $r_k$, responsibilities $q_k$, and gating weights, while providing closed-form, KL-regularized per-expert policy updates and a principled policy–reward alignment. Two architectural regimes are supported: (i) Mix-DPO with shared encoders and expert-specific heads with fixed weights, and (ii) MoE-DPO with input-dependent gating to route inputs to specialized experts and optionally personalize through user conditioning. Theoretical contributions include decomposition of the MBT ELBO and a per-expert objective that yields $\pi_k^*(y|x) \propto \pi_{\mathrm{ref}(k)}(y|x) \exp\left( r_k(x,y)/\beta \right)$, along with a stable variational training algorithm. Empirically, Mix-DPO improves sentiment and grammar rewards on IMDb-like multi-reward movie reviews and, under MoE-DPO, enables effective cross-domain alignment for movie vs book reviews with gating that supports personalization; these results demonstrate the approach’s scalability, modular deployment, and potential for user-specific, task-aware LLM alignment.

Abstract

Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

TL;DR

Mix- and MoE-DPO address the limitation of single-policy Direct Preference Optimization (DPO) by modeling the policy as a latent mixture over experts, with a soft context-dependent gating and a corresponding mixture reward . The framework leverages a variational EM approach on a Mixture-of-Bradley–Terry (MBT) likelihood to learn expert rewards , responsibilities , and gating weights, while providing closed-form, KL-regularized per-expert policy updates and a principled policy–reward alignment. Two architectural regimes are supported: (i) Mix-DPO with shared encoders and expert-specific heads with fixed weights, and (ii) MoE-DPO with input-dependent gating to route inputs to specialized experts and optionally personalize through user conditioning. Theoretical contributions include decomposition of the MBT ELBO and a per-expert objective that yields , along with a stable variational training algorithm. Empirically, Mix-DPO improves sentiment and grammar rewards on IMDb-like multi-reward movie reviews and, under MoE-DPO, enables effective cross-domain alignment for movie vs book reviews with gating that supports personalization; these results demonstrate the approach’s scalability, modular deployment, and potential for user-specific, task-aware LLM alignment.

Abstract

Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.

Paper Structure

This paper contains 29 sections, 7 theorems, 103 equations, 4 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

Let $(x, y^+, y^-)$ be a preference triplet with $y^+ \succ y^-$, and let $z \in \{1, \ldots, K\}$ be a latent expert index with prior $p(z = k \mid x) = w_k(x)$. Let $\sigma_k(x, y^+, y^-)$ denote the Bradley--Terry likelihood under expert $k$ given in eq:conditional likelihood under expert k. Then The bound is tight when $q_k(x, y^+, y^-) = \frac{w_k(x) \, \sigma_k(x, y^+, y^-)}{\sum_{j=1}^K w_j

Figures (4)

  • Figure 1: Left panel: Average posterior weights for Mix-DPO heads—(a) Head 0, (b) Head 1, and (c) Head 2—indicates specialization. Right panel: t-SNE plot of head-parameter representations indicates head separation.
  • Figure 2: Sampled words from responses generated by Mix-DPO heads in Case 1: (a) Head 0 (Informativeness and Grammar), (b) Head 1 (Positive Sentiment and Informativeness), and (c) Head 2 (Grammar).
  • Figure 3: Left panel: Average posterior weights for MoE-DPO heads in Case 1—(a) Head 0 and (b) Head 1—indicates specialization. Right panel: Average mixture weights for MoE-DPO indicates prompt separation.
  • Figure 4: Confusion matrices of frozen and learnable gating layers for movie (0) vs. book (1) prompts, with the difference indicating improvements in predicted labels under joint learning during training.

Theorems & Definitions (16)

  • Theorem 1: ELBO for the MBT Model
  • Corollary 1.1: MBT Variational Loss Function
  • Theorem 2: Equality Decomposition of Reward Mixture
  • Lemma 3: MoE-DPO Objective Decomposition
  • Theorem 4: Policy-Reward Equivalence under Mixture Models
  • Lemma 5
  • proof
  • proof : Proof of Theorem \ref{['theorem:ELBO for the MBT Model']}
  • proof : Proof of Theorem \ref{['theorem:Equality Decomposition of Reward Mixture']}
  • proof : Proof of Lemma \ref{['lemma:MoE-DPO Objective Decomposition']}
  • ...and 6 more