Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Kai Ye, Hongyi Zhou, Jin Zhu, Francesco Quinzan, Chengchun Shi
TL;DR
This work tackles robustness in RLHF for large language model fine-tuning by addressing misspecification in reward and preference models. It proposes Variance-Reduced Preference Optimization (VRPO), which introduces an auxiliary auxiliary preference model to reduce variance and achieve double robustness when paired with a known reference policy. Theoretical results show VRPO lowers the variance and MSE of the reward estimator and tightens the policy suboptimality gap; empirically, VRPO outperforms baselines on the HH dataset and synthetic IMDb tasks, with win rates frequently exceeding 77–81% under misspecification. The approach is versatile, applicable to both one-stage and two-stage RLHF pipelines, and comes with public code to facilitate adoption and further research.
Abstract
Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset. The code is available at https:// github.com/ VRPO/ VRPO.
