Table of Contents
Fetching ...

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Kai Ye, Hongyi Zhou, Jin Zhu, Francesco Quinzan, Chengchun Shi

TL;DR

This work tackles robustness in RLHF for large language model fine-tuning by addressing misspecification in reward and preference models. It proposes Variance-Reduced Preference Optimization (VRPO), which introduces an auxiliary auxiliary preference model to reduce variance and achieve double robustness when paired with a known reference policy. Theoretical results show VRPO lowers the variance and MSE of the reward estimator and tightens the policy suboptimality gap; empirically, VRPO outperforms baselines on the HH dataset and synthetic IMDb tasks, with win rates frequently exceeding 77–81% under misspecification. The approach is versatile, applicable to both one-stage and two-stage RLHF pipelines, and comes with public code to facilitate adoption and further research.

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset. The code is available at https:// github.com/ VRPO/ VRPO.

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

TL;DR

This work tackles robustness in RLHF for large language model fine-tuning by addressing misspecification in reward and preference models. It proposes Variance-Reduced Preference Optimization (VRPO), which introduces an auxiliary auxiliary preference model to reduce variance and achieve double robustness when paired with a known reference policy. Theoretical results show VRPO lowers the variance and MSE of the reward estimator and tightens the policy suboptimality gap; empirically, VRPO outperforms baselines on the HH dataset and synthetic IMDb tasks, with win rates frequently exceeding 77–81% under misspecification. The approach is versatile, applicable to both one-stage and two-stage RLHF pipelines, and comes with public code to facilitate adoption and further research.

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset. The code is available at https:// github.com/ VRPO/ VRPO.

Paper Structure

This paper contains 23 sections, 3 theorems, 43 equations, 5 figures, 12 tables.

Key Result

Theorem 6.1

In the correctly specified setting, the target parameter $\bar{\theta}=\arg\min_{\theta}\mathbb{E} [\widetilde{\mathcal{L}}(\theta)]$, when either the reference policy $\pi_{\textrm{ref}}$ or the auxiliary preference model $p_{\eta}$ is correctly specified.

Figures (5)

  • Figure 1: Training of LLMs. (a) The upper panel visualizes autoregressive next-token prediction in pre-training, where each token (e.g., a word or punctuation mark) is encoded into a numerical integer, and its probability depends on all preceding tokens through the transformer architecture. (b) The bottom panel visualizes SFT and RLHF in post-training. Supervised fine-tuning (SFT) fine-tunes the model on a small dataset of high-quality human-written answers to align its outputs with these answers. RLHF employs RL based on human preference, specifying which of the two candidate answers is preferred.
  • Figure 2: VRPO incorporates an auxiliary preference model to reduce the variance of the estimated primary model.Left: The classic one-stage and two-stage optimization schemes in RLHF. Both approaches require fitting a reward model, either explicitly or implicitly, which may lead to model misspecification. Right: In contrast, VRPO employs an auxiliary reward-free preference model to better capture human preferences. It works jointly with the primary model for variance reduction and policy improvement.
  • Figure 3: Comparisons in IMDb dataset. Left panel represents the expected reward in different VRPO setting compared to DPO, for example $(\pi_{ref}$ ✓, $P_{\eta}$ ✗$)$ means the reference model is correctly specified and the preference model is misspecified, and $P_{\eta}$$\boldsymbol{\hat{=}}$ means the preference model is estimated, demonstrating the robustness of our method. Middle plane illustrates the difference in preference probability distributions between the ground truth and the DPO estimation for both the chosen and rejected responses. Right The panel reports the expected reward versus KL-divergence for VRPO with $\beta = 0.1$ and DPO with $\beta \in \{0.02, 0.05,0.1\}$.
  • Figure 4: Head-to-head comparisons between VRPO, DPO, SFT. Win rates are evaluated by GPT-4o-mini. Left panel displays the win rate in the TL;DR dataset. Right panel displays the win rate in the HH dataset. In both datasets, VRPO outperforms DPO, achieving win rates above 50% directly and higher win rates against SFT indirectly.
  • Figure 5: Win rates of responses over the chosen response in the HH dataset.

Theorems & Definitions (7)

  • Theorem 6.1: Double Robustness
  • Theorem 6.2: Variance and MSE reductions
  • Theorem 6.3: Reduction in suboptimity gap
  • proof
  • proof
  • proof
  • proof : Proof of Theorem 3'