Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

Jianxiang Zang; Yongda Wei; Ruxue Bai; Shiyu Jiang; Nijia Mo; Binhong Li; Qiang Sun; Hui Liu

Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

Jianxiang Zang, Yongda Wei, Ruxue Bai, Shiyu Jiang, Nijia Mo, Binhong Li, Qiang Sun, Hui Liu

TL;DR

This work shifts RM evaluation from static preference accuracy to conditional reliability under real-world perturbations by introducing Suitability and Reward Auditor. It frames suitability inference as a non-parametric, paired statistical test across a diverse perturbation suite, controlling false discoveries with a group-aware FDR procedure. Case studies on RM Bench and Reward Bench reveal widespread vulnerability patterns, especially to stylized and semantic perturbations, and show that RM suitability strongly predicts downstream alignment performance under perturbations. The framework provides a principled, verifiable approach to diagnosing and mitigating latent RM vulnerabilities for safer, more robust LLM alignment.

Abstract

Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.

Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

TL;DR

Abstract

Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (13)