Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models

Rishabh Tiwari; Aditya Tomar; Udbhav Bamba; Monishwaran Maheswaran; Heng Yang; Michael W. Mahoney; Kurt Keutzer; Amir Gholami

Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models

Rishabh Tiwari, Aditya Tomar, Udbhav Bamba, Monishwaran Maheswaran, Heng Yang, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

TL;DR

It is demonstrated that state-of-the-art PRMs are systematically exploitable under adversarial optimization pressure, and current PRMs function as fluency detectors rather than reasoning verifiers, creating systematic blind spots that undermine their use as training signals.

Abstract

Process Reward Models (PRMs) are rapidly becoming the backbone of LLM reasoning pipelines, yet we demonstrate that state-of-the-art PRMs are systematically exploitable under adversarial optimization pressure. To address this, we introduce a three-tiered diagnostic framework that applies increasing adversarial pressure to quantify these vulnerabilities. Static perturbation analysis uncovers a fluency-logic dissociation: high invariance to surface-level style changes reward changes $<$0.1, yet inconsistent detection of logically-corrupted reasoning, with different models failing on different attack types. Adversarial optimization demonstrates that gradient-based attacks inflate rewards on invalid trajectories, with reward landscapes exhibiting wide, exploitable peaks. RL-induced reward hacking exposes the critical failure mode: policies trained on AIME problems achieve near-perfect PRM rewards ($>$0.9), while ground-truth accuracy remains low (below 4%), with 43% of reward gains attributable to stylistic shortcuts. These findings reveal that current PRMs function as fluency detectors rather than reasoning verifiers, creating systematic blind spots that undermine their use as training signals. We release PRM-BiasBench and a diagnostic toolkit to enable robustness evaluation before deployment. The code and dataset are available at https://github.com/SqueezeAILab/reward-under-attack.

Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models

TL;DR

Abstract

0.1, yet inconsistent detection of logically-corrupted reasoning, with different models failing on different attack types. Adversarial optimization demonstrates that gradient-based attacks inflate rewards on invalid trajectories, with reward landscapes exhibiting wide, exploitable peaks. RL-induced reward hacking exposes the critical failure mode: policies trained on AIME problems achieve near-perfect PRM rewards (

0.9), while ground-truth accuracy remains low (below 4%), with 43% of reward gains attributable to stylistic shortcuts. These findings reveal that current PRMs function as fluency detectors rather than reasoning verifiers, creating systematic blind spots that undermine their use as training signals. We release PRM-BiasBench and a diagnostic toolkit to enable robustness evaluation before deployment. The code and dataset are available at https://github.com/SqueezeAILab/reward-under-attack.

Paper Structure (53 sections, 2 equations, 13 figures, 4 tables)

This paper contains 53 sections, 2 equations, 13 figures, 4 tables.

Introduction
Related Work
Reward Model Vulnerabilities
Process Reward Models
Adversarial Attacks on Neural Networks
Reward Overoptimization
Preliminaries
Trajectory Level Reward Calculation.
Robustness Criteria.
Experimental Setup.
Static Perturbation Analysis
Perturbation Taxonomy
Results
Style Invariance.
Asymmetric Logic Detection.
...and 38 more sections

Figures (13)

Figure 1: Overview of static perturbation analysis. A prompt-response pair (Step 1) undergoes bias injection (Step 2), such as question shuffling where we change the question but do not modify the response (Step 3) and feed this to the PRM (Step 4). The scores are then compared against the original to quantify sensitivity (Step 5).
Figure 2: Distribution of $\Delta R$ under semantics-preserving perturbations. Both PRMs exhibit tight distributions centered near zero, indicating strong invariance to surface-level stylistic changes.
Figure 3: Distribution of $\Delta R$ under semantics-altering perturbations. (a) Question shuffling: Skywork penalizes mismatched questions by giving a smaller reward (peak at $\Delta R \approx -0.8$), while Qwen retains high rewards without any change. (b) Reasoning hallucination: Qwen exhibits bimodal behavior with strong penalization at $\Delta R = -1$ but also substantial mass near zero which is not desirable. An ideal PRM is expected to produce very low rewards (negative $\Delta R$) for both scenarios.
Figure 4: Reward landscape for a single continuous token ($k=1$) on Skywork-1.5B. A single optimized embedding vector rapidly increases mean batch reward, demonstrating that minimal adversarial capacity suffices to exploit PRM vulnerabilities.
Figure 5: Training dynamics for 100 discrete tokens on Skywork-1.5B across 8 AIME24 trajectories. Reward (blue) increases from 0.11 to 0.95 as entropy (orange) decreases, indicating successful discretization of adversarial tokens.
...and 8 more figures

Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models

TL;DR

Abstract

Reward Under Attack: Analyzing the Robustness and Hackability of Process Reward Models

Authors

TL;DR

Abstract

Table of Contents

Figures (13)