One Token to Fool LLM-as-a-Judge
Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, Dong Yu
TL;DR
The paper reveals that standard LLM-based reward models used as verifiers in RLVR are highly vulnerable to 'master key' inputs, such as simple non-word symbols or generic reasoning headers, which can trigger false positives. To counter this, the authors introduce Master Reward Models (Master-RMs) trained with a targeted data augmentation strategy that includes adversarial-like negative examples, achieving near-zero false positives across diverse benchmarks and model families. They validate robustness through extensive experiments, showing strong agreement with GPT-4o and human judgments and competitive performance on verifiable benchmarks like VerifyBench. The work underscores the need for resilient, trustworthy LLM evaluators and provides publicly available Master-RMs and synthetic data to catalyze further research in robust evaluation of AI systems.
Abstract
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.
