Table of Contents
Fetching ...

Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

Jianxiang Zang, Yongda Wei, Ruxue Bai, Shiyu Jiang, Nijia Mo, Binhong Li, Qiang Sun, Hui Liu

TL;DR

This work shifts RM evaluation from static preference accuracy to conditional reliability under real-world perturbations by introducing Suitability and Reward Auditor. It frames suitability inference as a non-parametric, paired statistical test across a diverse perturbation suite, controlling false discoveries with a group-aware FDR procedure. Case studies on RM Bench and Reward Bench reveal widespread vulnerability patterns, especially to stylized and semantic perturbations, and show that RM suitability strongly predicts downstream alignment performance under perturbations. The framework provides a principled, verifiable approach to diagnosing and mitigating latent RM vulnerabilities for safer, more robust LLM alignment.

Abstract

Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.

Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

TL;DR

This work shifts RM evaluation from static preference accuracy to conditional reliability under real-world perturbations by introducing Suitability and Reward Auditor. It frames suitability inference as a non-parametric, paired statistical test across a diverse perturbation suite, controlling false discoveries with a group-aware FDR procedure. Case studies on RM Bench and Reward Bench reveal widespread vulnerability patterns, especially to stylized and semantic perturbations, and show that RM suitability strongly predicts downstream alignment performance under perturbations. The framework provides a principled, verifiable approach to diagnosing and mitigating latent RM vulnerabilities for safer, more robust LLM alignment.

Abstract

Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.

Paper Structure

This paper contains 46 sections, 4 theorems, 60 equations, 14 figures, 17 tables, 2 algorithms.

Key Result

Lemma 3.1

Let $\mathcal{H}_0$ be the null hypothesis, under which the true p-value, $p_{\infty}$, is a random variable uniformly distributed on $[0, 1]$. Let $B$ be the number of permutations used to estimate the p-value. Let the random variable $C$ count the number of permuted test statistics that are greate This implies that the test is not exact, as the Type I error rate is generally not equal to $\alpha

Figures (14)

  • Figure 1: Perspectives on suitability of an RM. When the preference perception confidence distribution $\mathbb{P}_{\theta}$ of model $\mathcal{R}_{\theta}$ on dataset $D$ is stochastically less than the distribution on the perturbed dataset $\mathcal{P}(D)$ by more than a preset margin $m$, the RM is considered suitable under the perturbation. The figure style references pouget2025suitability
  • Figure 2: Marginal distribution metrics for suitability auditing of RMs on the 5 RM Bench subsets. The radar chart and bar chart present the marginal distribution metrics from the perturbation perspective and the RM perspective, respectively.
  • Figure 3: Correlation between the suitability risk of the RMs and the corresponding performance of the perturbed policy models. We report the Spearman's rank correlation coefficient of the linear correlation fit.
  • Figure 4: Spearman correlation analysis of paired permutation test p-values across different test statistics. We report the Spearman's rank correlation coefficient $\rho$ and the p-value of the linear correlation fit.
  • Figure 5: Spearman correlation analysis of p-values from the Wilcoxon signed-rank test and the permutation test on skewed samples in RM Bench. We report the Spearman's rank correlation coefficient $\rho$ and the p-value of the linear correlation fit.
  • ...and 9 more figures

Theorems & Definitions (13)

  • Definition 3.1: Suitability of Reward Modeling
  • Definition 3.2: Suitability Inference via Hypothesis Testing
  • Definition 4.1: Paired-sample Testing Metrics
  • Definition 4.3: Count-based Permutation p-value phipson2010permutation
  • Definition 4.4: Group-aware Benjamini-Hochberg Procedure
  • Lemma 3.1: phipson2010permutation
  • proof
  • Theorem 3.2: Exact Permutation p-value under a Uniform Prior
  • proof
  • Lemma 3.3: li2019multiple
  • ...and 3 more