The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Yanjun Chen; Dawei Zhu; Yirong Sun; Xinghao Chen; Wei Zhang; Xiaoyu Shen

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, Xiaoyu Shen

TL;DR

The paper challenges the assumption that higher reward-model accuracy always improves RLHF-tuned language models by showing that moderately accurate reward models can yield better performance across relevance, factuality, and completeness tasks. Using PPO-based RLHF on QA-FEEDBACK with Longformer-based reward models across T5-small/base/large, and evaluating with independent high-accuracy reward models, it demonstrates a paradox where over-accurate reward feedback can lead to overfitting and reduced generalization. The authors analyze training dynamics through KL divergence to explain why balanced, task-aligned rewards drive more robust learning. These findings imply that selecting reward models within a moderate accuracy range and monitoring training stability are crucial for effective RLHF, highlighting new directions for robust reward-model design and evaluation beyond mere accuracy.

Abstract

Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional details are available at https://github.com/EIT-NLP/AccuracyParadox-RLHF.

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

TL;DR

Abstract

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (27)