Table of Contents
Fetching ...

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton

TL;DR

MaPPO introduces a principled Maximum-a-Posteriori framework for preference optimization by integrating a prior reward knowledge term into the Direct Preference Optimization objective. By incorporating a reward gap $\Delta_r$ into the loss, MaPPO stabilizes updates, mitigates the squeezing effect, and yields better-calibrated policies without adding hyperparameters. The approach is compatible with offline and online settings and serves as a plugin to popular DPO variants such as SimPO, IPO, and CPO, consistently improving alignment across multiple model families and benchmarks. Empirical results demonstrate robust gains on AlpacaEval 2.0, Arena-Hard, and MT-Bench, while maintaining efficiency and scalability, highlighting MaPPO’s practical value for safer, more reliable LLM alignment.

Abstract

As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

TL;DR

MaPPO introduces a principled Maximum-a-Posteriori framework for preference optimization by integrating a prior reward knowledge term into the Direct Preference Optimization objective. By incorporating a reward gap into the loss, MaPPO stabilizes updates, mitigates the squeezing effect, and yields better-calibrated policies without adding hyperparameters. The approach is compatible with offline and online settings and serves as a plugin to popular DPO variants such as SimPO, IPO, and CPO, consistently improving alignment across multiple model families and benchmarks. Empirical results demonstrate robust gains on AlpacaEval 2.0, Arena-Hard, and MT-Bench, while maintaining efficiency and scalability, highlighting MaPPO’s practical value for safer, more reliable LLM alignment.

Abstract

As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

Paper Structure

This paper contains 51 sections, 89 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example of $(\mathbf{x}, \mathbf{y}_{w}, \mathbf{y}_{l})$ pair. Both responses $\mathbf{y}_{w}$ and $\mathbf{y}_{l}$ have good quality as they achieve high rewards, where $r(\mathbf{x}, \mathbf{y}_{w}) = 0.95$, $r(\mathbf{x}, \mathbf{y}_{l}) = 0.91$, and $r \in [0,1]$.
  • Figure 2: Under the standard MLE-based DPO (left), empirical studies pal2024smaugrafailov2024rtajwar2024preferenceren2024learning demonstrated that training tends to simultaneously downscale (with different magnitudes) both the chosen and rejected responses to increase their gap. Our MaP-based method (right) mitigates this harmful tendency by re-weighting the rejected response based on prior knowledge. Here, the x-axis denotes the initial model $\theta_0$ and a potentially harmful model $\theta_k$ that may arise during training, while the y-axis shows the log-likelihood of a fixed preference pair under different policies.
  • Figure 3: Illustration of the iterative MaPPO pipeline in each iteration $k$.
  • Figure 4: Before MLE optimization, the model consistently generates high-quality (high rewards) answers $\mathbf{y}_{w}$ and $\mathbf{y}_{l}$ with prompt $\mathbf{x}$.
  • Figure 5: After MLE optimization, the model degenerates, and the outputs $\mathbf{y}_{w}$ and $\mathbf{y}_{l}$ become verbose (low rewards) with prompt $\mathbf{x}$.
  • ...and 1 more figures

Theorems & Definitions (1)

  • proof