Establishing Reliability Metrics for Reward Models in Large Language Models

Yizhou Chen; Yawen Liu; Xuesi Wang; Qingtao Yu; Guangda Huzhang; Anxiang Zeng; Han Yu; Zhiming Zhou

Establishing Reliability Metrics for Reward Models in Large Language Models

Yizhou Chen, Yawen Liu, Xuesi Wang, Qingtao Yu, Guangda Huzhang, Anxiang Zeng, Han Yu, Zhiming Zhou

TL;DR

This work tackles the challenge of measuring RM reliability in aligning LLM outputs with human preferences by introducing RETA, a reliability metric defined as the average oracle quality among the RM’s top $\eta$-fraction of responses: $\text{RETA}_Y(\text{η})=\frac{1}{|\mathcal{Q}|}\,\mathbb{E}_{q}[\mathbb{E}_{a}[J_q(a)\;|\;Y_q(a)\geq \Theta(\eta)]]\,/\;\mathbb{E}_{a}[J_q(a)]$. An integrated benchmarking pipeline uses $k$-DPP prompt sampling, a competent reference policy (e.g., $\text{Llama2-7B-Chat}$), and an oracle (GPT-4) to obtain calibrated scores with asymptotic unbiased estimation and variance control via resampling. Empirical results show RETA converges rapidly with sample size and provides more stable, interpretable assessments than BON, with the normalizer mitigating prompt-bias effects and RM reliability varying across models and quantile levels. The approach yields a cost-efficient, direct RM reliability benchmark that can guide RM development and RLHF training, enabling practitioners to identify reliable RMs and quantify reliability across different decision regimes.

Abstract

The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the \textit{\underline{R}eliable at \underline{$η$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $η$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.

Establishing Reliability Metrics for Reward Models in Large Language Models

TL;DR

-fraction of responses:

. An integrated benchmarking pipeline uses

-DPP prompt sampling, a competent reference policy (e.g.,

), and an oracle (GPT-4) to obtain calibrated scores with asymptotic unbiased estimation and variance control via resampling. Empirical results show RETA converges rapidly with sample size and provides more stable, interpretable assessments than BON, with the normalizer mitigating prompt-bias effects and RM reliability varying across models and quantile levels. The approach yields a cost-efficient, direct RM reliability benchmark that can guide RM development and RLHF training, enabling practitioners to identify reliable RMs and quantify reliability across different decision regimes.

Abstract

}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top

quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.

Establishing Reliability Metrics for Reward Models in Large Language Models

TL;DR

Abstract

Establishing Reliability Metrics for Reward Models in Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (3)