Table of Contents
Fetching ...

One Token to Fool LLM-as-a-Judge

Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, Dong Yu

TL;DR

The paper reveals that standard LLM-based reward models used as verifiers in RLVR are highly vulnerable to 'master key' inputs, such as simple non-word symbols or generic reasoning headers, which can trigger false positives. To counter this, the authors introduce Master Reward Models (Master-RMs) trained with a targeted data augmentation strategy that includes adversarial-like negative examples, achieving near-zero false positives across diverse benchmarks and model families. They validate robustness through extensive experiments, showing strong agreement with GPT-4o and human judgments and competitive performance on verifiable benchmarks like VerifyBench. The work underscores the need for resilient, trustworthy LLM evaluators and provides publicly available Master-RMs and synthetic data to catalyze further research in robust evaluation of AI systems.

Abstract

Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

One Token to Fool LLM-as-a-Judge

TL;DR

The paper reveals that standard LLM-based reward models used as verifiers in RLVR are highly vulnerable to 'master key' inputs, such as simple non-word symbols or generic reasoning headers, which can trigger false positives. To counter this, the authors introduce Master Reward Models (Master-RMs) trained with a targeted data augmentation strategy that includes adversarial-like negative examples, achieving near-zero false positives across diverse benchmarks and model families. They validate robustness through extensive experiments, showing strong agreement with GPT-4o and human judgments and competitive performance on verifiable benchmarks like VerifyBench. The work underscores the need for resilient, trustworthy LLM evaluators and provides publicly available Master-RMs and synthetic data to catalyze further research in robust evaluation of AI systems.

Abstract

Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

Paper Structure

This paper contains 40 sections, 2 equations, 9 figures, 19 tables.

Figures (9)

  • Figure 1: Systematic vulnerabilities of LLM judges exposed by "master key" attacks across diverse datasets. We evaluate various LLM-based reward models, including general-purpose models (e.g., Qwen2.5-72B, GPT-4o) and dedicated verifiers (e.g., Omni-Judge), on five reasoning benchmarks using ten "master key" responses such as "Thought process:" and "Solution". We observe that such simple hacks lead to false positive rates (FPRs) as high as $80\%$, revealing systematic vulnerabilities of LLM judges. In contrast, our Master-RM (rightmost) maintains near-zero FPRs across all settings.
  • Figure 2: In a "collapsed" RLVR training, the response length drops sharply to fewer than 30 tokens while the KL divergence surges, a dynamic that differs significantly from a non-collapsed run.
  • Figure 3: Reasoning openers such as "Solution" can trigger false positive rewards in many state-of-the-art LLMs when used as generative reward models. See Table \ref{['tab:appendix:examples']} for more examples.
  • Figure 4: False positive rate (FPR) versus scaling of Qwen models. We evaluate the FPRs of the Qwen2.5-Instruct model series (with sizes 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B) and analyze how FPR varies with model size. In all figures above, X-axis is model size (B) and y-axis is FPR averaged over all the ten "master keys" listed in Table \ref{['tab:merged-five']}.
  • Figure 5: Multi-subject RLVR Benchmark
  • ...and 4 more figures