Table of Contents
Fetching ...

CHARM: Calibrating Reward Models With Chatbot Arena Scores

Xiao Zhu, Chenmien Tan, Pinzhen Chen, Rico Sennrich, Yanlin Zhang, Hanxu Hu

TL;DR

This work identifies a systematic model preference bias in reward models used for RLHF and proposes CHARM, a calibration method that leverages Elo rankings from Chatbot Arena to debias RM judgments. CHARM constructs a debiased preference dataset by introducing a score offset $\Delta$ to over-valued model scores and optimizes it to align RM win probabilities with Elo-based expectations $\mathbb{P}(O)$, quantified through the Mismatch Degree $\text{MD}$. Across RM-Bench and RewardBench, CHARM consistently improves evaluation accuracy and strengthens alignment with human preferences, with larger gains for models exhibiting higher MD. The method generalizes to unseen models and provides a practical, efficient tool for building fairer, more reliable reward models, with MD serving as a predictive indicator of calibration benefit.

Abstract

Reward models (RMs) play a crucial role in Reinforcement Learning from Human Feedback by serving as proxies for human preferences in aligning large language models. In this paper, we identify a model preference bias in RMs, where they systematically assign disproportionately high scores to responses from certain policy models. This bias distorts ranking evaluations and leads to unfair judgments. To address this issue, we propose a calibration method named CHatbot Arena calibrated Reward Modeling (CHARM) that leverages Elo scores from the Chatbot Arena leaderboard to mitigate RM overvaluation. We also introduce a Mismatch Degree metric to measure this preference bias. Our approach is computationally efficient, requiring only a small preference dataset for continued training of the RM. We conduct extensive experiments on reward model benchmarks and human preference alignment. Results demonstrate that our calibrated RMs (1) achieve improved evaluation accuracy on RM-Bench and the Chat-Hard domain of RewardBench, and (2) exhibit a stronger correlation with human preferences by producing scores more closely aligned with Elo rankings. By mitigating model preference bias, our method provides a generalizable and efficient solution for building fairer and more reliable reward models.

CHARM: Calibrating Reward Models With Chatbot Arena Scores

TL;DR

This work identifies a systematic model preference bias in reward models used for RLHF and proposes CHARM, a calibration method that leverages Elo rankings from Chatbot Arena to debias RM judgments. CHARM constructs a debiased preference dataset by introducing a score offset to over-valued model scores and optimizes it to align RM win probabilities with Elo-based expectations , quantified through the Mismatch Degree . Across RM-Bench and RewardBench, CHARM consistently improves evaluation accuracy and strengthens alignment with human preferences, with larger gains for models exhibiting higher MD. The method generalizes to unseen models and provides a practical, efficient tool for building fairer, more reliable reward models, with MD serving as a predictive indicator of calibration benefit.

Abstract

Reward models (RMs) play a crucial role in Reinforcement Learning from Human Feedback by serving as proxies for human preferences in aligning large language models. In this paper, we identify a model preference bias in RMs, where they systematically assign disproportionately high scores to responses from certain policy models. This bias distorts ranking evaluations and leads to unfair judgments. To address this issue, we propose a calibration method named CHatbot Arena calibrated Reward Modeling (CHARM) that leverages Elo scores from the Chatbot Arena leaderboard to mitigate RM overvaluation. We also introduce a Mismatch Degree metric to measure this preference bias. Our approach is computationally efficient, requiring only a small preference dataset for continued training of the RM. We conduct extensive experiments on reward model benchmarks and human preference alignment. Results demonstrate that our calibrated RMs (1) achieve improved evaluation accuracy on RM-Bench and the Chat-Hard domain of RewardBench, and (2) exhibit a stronger correlation with human preferences by producing scores more closely aligned with Elo rankings. By mitigating model preference bias, our method provides a generalizable and efficient solution for building fairer and more reliable reward models.

Paper Structure

This paper contains 22 sections, 6 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Average reward model scores across policy models on AlpacaEval. The x-axis represents arena Elo scores. The left lower plot illustrates the Length-Controlled win rates of these models on AlpacaEval.
  • Figure 2: Win rates and Mismatch Degrees before and after calibration. In the win rate plots, the x-axis is the expected win rates calculated based on the models' Elo scores, while the y-axis is the win rates derived from the reward model scores. Points closer to the dotted line indicate a better alignment between the reward model and human preferences.
  • Figure 3: Score results of Skywork-RM on more policy models in the AlpacaEval dataset.
  • Figure 4: Score results of Mistral-RM on more policy models in the AlpacaEval dataset.
  • Figure 5: Score results of FsfairX-RM on more policy models in the AlpacaEval dataset.
  • ...and 2 more figures