Table of Contents
Fetching ...

Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment

Shengyang Sun, Yian Zhang, Alexander Bukharin, David Mosallanezhad, Jiaqi Zeng, Soumye Singhal, Gerald Shen, Adithya Renduchintala, Tugrul Konuk, Yi Dong, Zhilin Wang, Dmitry Chichkov, Olivier Delalleau, Oleksii Kuchaiev

TL;DR

This work presents Reward-aware Preference Optimization (RPO), a unifying mathematical framework that connects offline and online preference optimization methods (DPO, IPO, SimPO, RLHF variants) by varying distance metrics, reward models, and data collection choices. Through a synthetic Ground-Truth judge setup, it enables clean ablations to identify which design factors most influence alignment, and introduces online RPO-bwd as a competitive variant with improved stability. The study demonstrates that RPO subsumes many existing algorithms, shows when online vs offline approaches excel, and highlights the critical role of reward-model quality in online settings. It also provides practical alignment recommendations and a roadmap for future work, including token-level extensions and broader evaluation regimes.

Abstract

The rapid development of large language model (LLM) alignment algorithms has resulted in a complex and fragmented landscape, with limited clarity on the effectiveness of different methods and their inter-connections. This paper introduces Reward-Aware Preference Optimization (RPO), a mathematical framework that unifies popular preference optimization techniques in LLM alignment, including DPO, IPO, SimPO, and REINFORCE (LOO), among others. RPO provides a structured approach to disentangle and systematically study the impact of various design choices, such as the optimization objective, the number of responses per prompt, and the use of implicit versus explicit reward models, on LLM preference optimization. We additionally propose a new experimental setup that enables the clean and direct ablation of such design choices. Through an extensive series of ablation studies within the RPO framework, we gain insights into the critical factors shaping model alignment, offering practical guidance on the most effective strategies for improving LLM alignment.

Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment

TL;DR

This work presents Reward-aware Preference Optimization (RPO), a unifying mathematical framework that connects offline and online preference optimization methods (DPO, IPO, SimPO, RLHF variants) by varying distance metrics, reward models, and data collection choices. Through a synthetic Ground-Truth judge setup, it enables clean ablations to identify which design factors most influence alignment, and introduces online RPO-bwd as a competitive variant with improved stability. The study demonstrates that RPO subsumes many existing algorithms, shows when online vs offline approaches excel, and highlights the critical role of reward-model quality in online settings. It also provides practical alignment recommendations and a roadmap for future work, including token-level extensions and broader evaluation regimes.

Abstract

The rapid development of large language model (LLM) alignment algorithms has resulted in a complex and fragmented landscape, with limited clarity on the effectiveness of different methods and their inter-connections. This paper introduces Reward-Aware Preference Optimization (RPO), a mathematical framework that unifies popular preference optimization techniques in LLM alignment, including DPO, IPO, SimPO, and REINFORCE (LOO), among others. RPO provides a structured approach to disentangle and systematically study the impact of various design choices, such as the optimization objective, the number of responses per prompt, and the use of implicit versus explicit reward models, on LLM preference optimization. We additionally propose a new experimental setup that enables the clean and direct ablation of such design choices. Through an extensive series of ablation studies within the RPO framework, we gain insights into the critical factors shaping model alignment, offering practical guidance on the most effective strategies for improving LLM alignment.

Paper Structure

This paper contains 40 sections, 1 theorem, 26 equations, 4 figures, 4 tables, 4 algorithms.

Key Result

Theorem 3.1

When using the Bernoulli distribution KL divergence in Reward-aware preference optimization and $\beta=1, \eta=1$, the objective is equivalent to Equation 22 in pandey2024brain.

Figures (4)

  • Figure 1: The average reward (left) and win-rate (mid) over lmsys (valid) prompts along training. The right figure shows the MT bench (judged by Mistral Large 2). Error bars represent 95% confidence intervals over 3 independent runs. We compare two training datasets, which are generated by the llama3-8b-sft model using lmsys and synthetic prompts, respectively. We observe training on in-distribution lmsys prompts achieves higher rewards than training on out-of-distribution synthetic prompts. However, the MT-Bench metric has a large variance, hardly showing any learnings.
  • Figure 2: Online RPO-bwd vs online RPO-sqloo (RLOO). We plot average rewards on lmsys(valid) (left) and the KL divergence with the reference policy (right). The valid reward increases faster and the KL divergence increases slower for RPO-bwd. This indicates that online RPO-bwd can better optimize the RLHF objective (Eq \ref{['eq:rlhf-objective']}) than RLOO. In addition, RLOO's training exploded in the middle; while RPO-bwd's training kept stable in all our runs.
  • Figure 3: Performance improves consistently with more iterations.
  • Figure 4: GT RM's rewards vs Learnt RM's rewards along training.

Theorems & Definitions (2)

  • Theorem 3.1
  • proof