Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

Jason Bohne; Pawel Polak; David Rosenberg; Brian Bloniarz; Gary Kazantsev

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

Jason Bohne, Pawel Polak, David Rosenberg, Brian Bloniarz, Gary Kazantsev

TL;DR

Mix- and MoE-DPO address the limitation of single-policy Direct Preference Optimization (DPO) by modeling the policy as a latent mixture over $K$ experts, with a soft context-dependent gating $w_k(x)$ and a corresponding mixture reward $r(x,y) = \log \sum_{k=1}^K w_k(x) \exp(r_k(x,y))$. The framework leverages a variational EM approach on a Mixture-of-Bradley–Terry (MBT) likelihood to learn expert rewards $r_k$, responsibilities $q_k$, and gating weights, while providing closed-form, KL-regularized per-expert policy updates and a principled policy–reward alignment. Two architectural regimes are supported: (i) Mix-DPO with shared encoders and expert-specific heads with fixed weights, and (ii) MoE-DPO with input-dependent gating to route inputs to specialized experts and optionally personalize through user conditioning. Theoretical contributions include decomposition of the MBT ELBO and a per-expert objective that yields $\pi_k^*(y|x) \propto \pi_{\mathrm{ref}(k)}(y|x) \exp\left( r_k(x,y)/\beta \right)$, along with a stable variational training algorithm. Empirically, Mix-DPO improves sentiment and grammar rewards on IMDb-like multi-reward movie reviews and, under MoE-DPO, enables effective cross-domain alignment for movie vs book reviews with gating that supports personalization; these results demonstrate the approach’s scalability, modular deployment, and potential for user-specific, task-aware LLM alignment.

Abstract

Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

TL;DR

Mix- and MoE-DPO address the limitation of single-policy Direct Preference Optimization (DPO) by modeling the policy as a latent mixture over

experts, with a soft context-dependent gating

and a corresponding mixture reward

. The framework leverages a variational EM approach on a Mixture-of-Bradley–Terry (MBT) likelihood to learn expert rewards

, responsibilities

, and gating weights, while providing closed-form, KL-regularized per-expert policy updates and a principled policy–reward alignment. Two architectural regimes are supported: (i) Mix-DPO with shared encoders and expert-specific heads with fixed weights, and (ii) MoE-DPO with input-dependent gating to route inputs to specialized experts and optionally personalize through user conditioning. Theoretical contributions include decomposition of the MBT ELBO and a per-expert objective that yields

, along with a stable variational training algorithm. Empirically, Mix-DPO improves sentiment and grammar rewards on IMDb-like multi-reward movie reviews and, under MoE-DPO, enables effective cross-domain alignment for movie vs book reviews with gating that supports personalization; these results demonstrate the approach’s scalability, modular deployment, and potential for user-specific, task-aware LLM alignment.

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

TL;DR

Abstract

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (16)