Table of Contents
Fetching ...

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, Xiaoyu Shen

TL;DR

The paper challenges the assumption that higher reward-model accuracy always improves RLHF-tuned language models by showing that moderately accurate reward models can yield better performance across relevance, factuality, and completeness tasks. Using PPO-based RLHF on QA-FEEDBACK with Longformer-based reward models across T5-small/base/large, and evaluating with independent high-accuracy reward models, it demonstrates a paradox where over-accurate reward feedback can lead to overfitting and reduced generalization. The authors analyze training dynamics through KL divergence to explain why balanced, task-aligned rewards drive more robust learning. These findings imply that selecting reward models within a moderate accuracy range and monitoring training stability are crucial for effective RLHF, highlighting new directions for robust reward-model design and evaluation beyond mere accuracy.

Abstract

Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional details are available at https://github.com/EIT-NLP/AccuracyParadox-RLHF.

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

TL;DR

The paper challenges the assumption that higher reward-model accuracy always improves RLHF-tuned language models by showing that moderately accurate reward models can yield better performance across relevance, factuality, and completeness tasks. Using PPO-based RLHF on QA-FEEDBACK with Longformer-based reward models across T5-small/base/large, and evaluating with independent high-accuracy reward models, it demonstrates a paradox where over-accurate reward feedback can lead to overfitting and reduced generalization. The authors analyze training dynamics through KL divergence to explain why balanced, task-aligned rewards drive more robust learning. These findings imply that selecting reward models within a moderate accuracy range and monitoring training stability are crucial for effective RLHF, highlighting new directions for robust reward-model design and evaluation beyond mere accuracy.

Abstract

Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional details are available at https://github.com/EIT-NLP/AccuracyParadox-RLHF.

Paper Structure

This paper contains 30 sections, 2 equations, 27 figures, 7 tables.

Figures (27)

  • Figure 1: 3D surface plot evaluating relevance ratios for T5-small. Optimal performance was achieved with reward models having moderate accuracy.
  • Figure 2: 3D surface plot evaluating factuality ratios for T5-small. The best performance was seen with reward models of moderate accuracy.
  • Figure 3: 3D surface plot evaluating completeness rewards for T5-small. Intermediate reward model strength yielded the best language model performance.
  • Figure 4: Reward analysis for relevance task (T5-small model): training steps vs. rewards (left), mean and variance of rewards (right).
  • Figure 5: Reward analysis for factuality task (T5-small model): training steps vs. rewards (left), mean and variance of rewards (right).
  • ...and 22 more figures