Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration
Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Anh Tuan Luu
TL;DR
This work critiques PPO-based RLHF and order-based calibration methods for LLM alignment, highlighting how ignoring reward magnitudes can hamper performance. It introduces Value-based Calibration (VCB), a loss that ties policy-probability gaps directly to normalized reward gaps using a derived formulation that preserves the reward signal while avoiding partition-function elimination. Through a three-step offline training pipeline combining SFT, reward-model training, and pairwise logit-gap calibration, VCB achieves superior alignment on AI assistant and summarization tasks, with strong generalization to out-of-distribution data. The results show robust improvements over RRHF, SLiC, DPO, SFT, and PPO, and emphasize the importance of reward-model quality for maximizing gains, while acknowledging resource constraints and reward-model accuracy as key factors for future work.
Abstract
While Reinforcement Learning from Human Feedback (RLHF) significantly enhances the generation quality of Large Language Models (LLMs), recent studies have raised concerns regarding the complexity and instability associated with the Proximal Policy Optimization (PPO) algorithm, proposing a series of order-based calibration methods as viable alternatives. This paper delves further into current order-based methods, examining their inefficiencies in utilizing reward values and addressing misalignment issues. Building upon these findings, we propose a novel \textbf{V}alue-based \textbf{C}ali\textbf{B}ration (VCB) method to better align LLMs with human preferences. Experimental results demonstrate that VCB surpasses existing alignment methods on AI assistant and summarization datasets, providing impressive generalizability, robustness, and stability in diverse settings.
