Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration

Xin Mao; Feng-Lin Li; Huimin Xu; Wei Zhang; Anh Tuan Luu

Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration

Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Anh Tuan Luu

TL;DR

This work critiques PPO-based RLHF and order-based calibration methods for LLM alignment, highlighting how ignoring reward magnitudes can hamper performance. It introduces Value-based Calibration (VCB), a loss that ties policy-probability gaps directly to normalized reward gaps using a derived formulation that preserves the reward signal while avoiding partition-function elimination. Through a three-step offline training pipeline combining SFT, reward-model training, and pairwise logit-gap calibration, VCB achieves superior alignment on AI assistant and summarization tasks, with strong generalization to out-of-distribution data. The results show robust improvements over RRHF, SLiC, DPO, SFT, and PPO, and emphasize the importance of reward-model quality for maximizing gains, while acknowledging resource constraints and reward-model accuracy as key factors for future work.

Abstract

While Reinforcement Learning from Human Feedback (RLHF) significantly enhances the generation quality of Large Language Models (LLMs), recent studies have raised concerns regarding the complexity and instability associated with the Proximal Policy Optimization (PPO) algorithm, proposing a series of order-based calibration methods as viable alternatives. This paper delves further into current order-based methods, examining their inefficiencies in utilizing reward values and addressing misalignment issues. Building upon these findings, we propose a novel \textbf{V}alue-based \textbf{C}ali\textbf{B}ration (VCB) method to better align LLMs with human preferences. Experimental results demonstrate that VCB surpasses existing alignment methods on AI assistant and summarization datasets, providing impressive generalizability, robustness, and stability in diverse settings.

Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration

TL;DR

Abstract

Paper Structure (26 sections, 1 theorem, 35 equations, 6 figures, 11 tables)

This paper contains 26 sections, 1 theorem, 35 equations, 6 figures, 11 tables.

Introduction
Related Work
Unifying RRHF, SLiC and DPO
The Proposed Approach
Value-based Calibration Loss
Derivation
Training Pipeline
Experiments
Tasks and Datasets
Evaluation
Baselines
Implementation Detail
Main Experimental Results
Accuracy of Reward Models
Out-of-distribution Generalization
...and 11 more sections

Key Result

Theorem 1

If $\psi_\pi(y|x) = -\alpha(x)[\log \pi(y|x) + \beta(x,y)]$, $\alpha(x)$ and $\beta(x,y)$ do not depend on the policy $\pi$, and $\alpha(x)>0$ for all prompts $x$, the optimal solution of Eq.eq2 is: $Z(x)=\sum_y {e^{\frac{r(x,y)}{\alpha(x)}-\beta(x,y)}}$ represents the partition function. Detailed proof is in Appendix appendix 1.

Figures (6)

Figure 1: Order-based method Vs. Value-based method.
Figure 2: Illustration of $\Delta^{\pi}_{y_1}$, $\Delta^{\pi}_{y_2}$ and $\Delta^{r}_{y_1,y_2}$.
Figure 3: The training pipeline of the proposed value-based calibration method.
Figure 4: GPT-4 evaluation results on comparison of win, tie, and lose ratios of VCB against all baselines.
Figure 5: Win rate of VCB with various $\gamma$ against the preferred response $y_w \in \mathcal{D}_{\text{p}}$.
...and 1 more figures

Theorems & Definitions (1)

Theorem 1

Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration

TL;DR

Abstract

Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (1)