Table of Contents
Fetching ...

Establishing Reliability Metrics for Reward Models in Large Language Models

Yizhou Chen, Yawen Liu, Xuesi Wang, Qingtao Yu, Guangda Huzhang, Anxiang Zeng, Han Yu, Zhiming Zhou

TL;DR

This work tackles the challenge of measuring RM reliability in aligning LLM outputs with human preferences by introducing RETA, a reliability metric defined as the average oracle quality among the RM’s top $\eta$-fraction of responses: $\text{RETA}_Y(\text{η})=\frac{1}{|\mathcal{Q}|}\,\mathbb{E}_{q}[\mathbb{E}_{a}[J_q(a)\;|\;Y_q(a)\geq \Theta(\eta)]]\,/\;\mathbb{E}_{a}[J_q(a)]$. An integrated benchmarking pipeline uses $k$-DPP prompt sampling, a competent reference policy (e.g., $\text{Llama2-7B-Chat}$), and an oracle (GPT-4) to obtain calibrated scores with asymptotic unbiased estimation and variance control via resampling. Empirical results show RETA converges rapidly with sample size and provides more stable, interpretable assessments than BON, with the normalizer mitigating prompt-bias effects and RM reliability varying across models and quantile levels. The approach yields a cost-efficient, direct RM reliability benchmark that can guide RM development and RLHF training, enabling practitioners to identify reliable RMs and quantify reliability across different decision regimes.

Abstract

The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the \textit{\underline{R}eliable at \underline{$η$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $η$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.

Establishing Reliability Metrics for Reward Models in Large Language Models

TL;DR

This work tackles the challenge of measuring RM reliability in aligning LLM outputs with human preferences by introducing RETA, a reliability metric defined as the average oracle quality among the RM’s top -fraction of responses: . An integrated benchmarking pipeline uses -DPP prompt sampling, a competent reference policy (e.g., ), and an oracle (GPT-4) to obtain calibrated scores with asymptotic unbiased estimation and variance control via resampling. Empirical results show RETA converges rapidly with sample size and provides more stable, interpretable assessments than BON, with the normalizer mitigating prompt-bias effects and RM reliability varying across models and quantile levels. The approach yields a cost-efficient, direct RM reliability benchmark that can guide RM development and RLHF training, enabling practitioners to identify reliable RMs and quantify reliability across different decision regimes.

Abstract

The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the \textit{\underline{R}eliable at \underline{}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.

Paper Structure

This paper contains 22 sections, 2 theorems, 10 equations, 9 figures, 4 tables.

Key Result

Theorem 1

Let $F$ denote the cumulative distribution function of the random variable $X\equiv Y_{q}(a)$, $a\sim \theta_{q}(a)$, and let $\Theta(\eta)=\inf (x:F(x)\geq 1-\eta)$ define the quantile function. Assume ${J}_{q}$ is bounded. Then, the following holds at all continuity points $\eta$ of $\Theta$.

Figures (9)

  • Figure 1: The benchmark building pipeline and the computation of the RETA metric.
  • Figure 2: The results on Reliability-on-Helpfulness benchmark: (a) The estimation of RETA($\eta=1/4$) in Eq. \ref{['equ:approx']} versus resampled size $n$. Dashed horizontal lines mark the limiting values, with the light grey shaded area representing the range used to calculate the final estimation of RETA's limiting values (Sec. \ref{['subsec: estimation']}). (b) Best-of-n (BON) curve versus $n$. The x-axis can also be transformed into KL divergence without loss of generality. (c) 2nd-best-of-n curve versus $n$. (d) The average oracle scores of best-32-of-n curve versus $n$. For above figures, the standard error across prompts is plotted.
  • Figure 3: The RETA curves on Reliability-on-Helpfulness benchmark. The x-axis is plotted on a logarithmic scale ($-\log_2\eta$) for better visualization. Each curve is composed of 15 points connected by interpolation. The standard error across prompts is plotted. Note that as $\eta$ decreases, the evaluation tends to become noisier.
  • Figure 4: The fitted curve of RETA metric versus prompt perplexity on Reliability-on-Helpfulness benchmark, (a) with normalizer, (b) without normalizer. The fitting is performed using Gaussian Processes, and curves in both figures utilize the same RBF kernel with a length scale of 0.5. To ensure a fair visual comparison, both figures plot the y-range as $\pm10\%$ of the mean value (of all curves).
  • Figure 5: The Hit Rate metric evaluated on Reliability-on-Helpfulness dataset: (a) The hit rate at $n$ is calculated by comparing the top $n$ responses generated by the RM with the ground truth set, which is defined as the $\eta=1/4$ quantile of responses selected by the oracle. (b) The replot of (a) by subtracting the result of Pythia-1.4B as the baseline. (c) The zoom-in version of (b) which provides a closer look at the head region.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Lemma 1
  • proof