Table of Contents
Fetching ...

UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function

Zhichao Wang, Bin Bi, Can Huang, Shiva Kumar Pentyala, Zixu James Zhu, Sitaram Asur, Na Claire Cheng

TL;DR

UNA introduces a generalized implicit reward function to unify RLHF/PPO, DPO, and KTO by reframing alignment as minimizing the discrepancy between an implicit reward and explicit feedback. The core result is $r_\theta(x,y)=\beta \log\left(\frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}\right)$ (up to $f(x)$ and constants), enabling online and offline learning across pairwise, binary, and score-based data. Offline UNA matches DPO on pairwise data and beats KTO on binary/score-based feedback, while Online UNA replaces PPO with a stable supervised-like update, delivering improved performance and reduced training time/memory. Overall, UNA provides a scalable, unified framework for robust LLM alignment with diverse feedback signals, improving stability, efficiency, and applicability to multiple data modalities.

Abstract

An LLM is pretrained on trillions of tokens, but the pretrained LLM may still generate undesired responses. To solve this problem, alignment techniques such as RLHF, DPO and KTO are proposed. However, these alignment techniques have limitations. For example, RLHF requires training the reward model and policy separately, which is complex, time-consuming, memory intensive and unstable during training processes. DPO proposes a mapping between an optimal policy and a reward, greatly simplifying the training process of RLHF. However, it can not take full advantages of a reward model and it is limited to pairwise preference data. In this paper, we propose \textbf{UN}ified \textbf{A}lignment (UNA) which unifies RLHF/PPO, DPO and KTO. Firstly, we mathematically prove that given the classical RLHF objective, the optimal policy is induced by a generalize implicit reward function. With this novel mapping between a reward model and an optimal policy, UNA can 1. unify RLHF/PPO, DPO and KTO into a supervised learning of minimizing the difference between an implicit reward and an explicit reward; 2. outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process; 3. accommodate different feedback types including pairwise, binary and scalar feedback. Downstream experiments show UNA outperforms DPO, KTO and RLHF.

UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function

TL;DR

UNA introduces a generalized implicit reward function to unify RLHF/PPO, DPO, and KTO by reframing alignment as minimizing the discrepancy between an implicit reward and explicit feedback. The core result is (up to and constants), enabling online and offline learning across pairwise, binary, and score-based data. Offline UNA matches DPO on pairwise data and beats KTO on binary/score-based feedback, while Online UNA replaces PPO with a stable supervised-like update, delivering improved performance and reduced training time/memory. Overall, UNA provides a scalable, unified framework for robust LLM alignment with diverse feedback signals, improving stability, efficiency, and applicability to multiple data modalities.

Abstract

An LLM is pretrained on trillions of tokens, but the pretrained LLM may still generate undesired responses. To solve this problem, alignment techniques such as RLHF, DPO and KTO are proposed. However, these alignment techniques have limitations. For example, RLHF requires training the reward model and policy separately, which is complex, time-consuming, memory intensive and unstable during training processes. DPO proposes a mapping between an optimal policy and a reward, greatly simplifying the training process of RLHF. However, it can not take full advantages of a reward model and it is limited to pairwise preference data. In this paper, we propose \textbf{UN}ified \textbf{A}lignment (UNA) which unifies RLHF/PPO, DPO and KTO. Firstly, we mathematically prove that given the classical RLHF objective, the optimal policy is induced by a generalize implicit reward function. With this novel mapping between a reward model and an optimal policy, UNA can 1. unify RLHF/PPO, DPO and KTO into a supervised learning of minimizing the difference between an implicit reward and an explicit reward; 2. outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process; 3. accommodate different feedback types including pairwise, binary and scalar feedback. Downstream experiments show UNA outperforms DPO, KTO and RLHF.
Paper Structure (22 sections, 25 equations, 2 figures, 6 tables)

This paper contains 22 sections, 25 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: A figure comparison among (a). UNA, (b) RLHF, (c) DPO and (d) KTO. Each subfigure is composed of four types of data: "prompt data", "preference feedback", "binary feedback" and "score feedback", "LLM policy", "response", two reward models: "generalized implicit reward model" and "explicit reward model" and a module to minimize the difference between implicit and explicit rewards. The connection between data to other modules are utilizing green dash arrow, while others are connected by black solid arrow. All unused modules are grayed out. In part (b), RLHF firstly utilizes preference feedback to train the explicit reward model, and the use the evaluation provided by the explicit reward model to continuous optimize the policy in a online mode. In comparison, in part (c) and (d), DPO and KTO utilize preference feedback and binary feedback respectively to generate implicit reward to align LLM policy. However, in part (a), UNA can utilize different types of data to get generalized implicit and explicit rewards and minimize their differences to align LLM policy in both online and offline modes.
  • Figure 2: The two applications of UNA: Offline UNA and Online UNA. Offline UNA includes (a). equivalent to DPO for pairwise data, (b). improvement over KTO for binary data, (c). RM/LLM distillation for score-based data. Online UNA includes (d). simplification of RLHF for online training. The same modules are utilized as in Figure \ref{['fig:RLHF/PPO-DPO-KTO-UNA']}, and unused modules are grayed out. For part (a), the same steps as DPO will be utilized. For part (b), (c), (d), from the different types of data including pairwise, binary and score-based feedback, implicit and explicit rewards are firstly gathered. Then, the difference between implicit and explicit rewards is minimized like MSE loss function to align the LLM policy. More details can be found in Section \ref{['Section: UNA details']}.