LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge
Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, Shouling Ji
TL;DR
This paper introduces RobustJudge, a fully automated framework for evaluating the robustness of LLMs used as evaluators (LLM-as-a-Judge) under adversarial manipulation. It systematically assesses 15 attack methods and 7 defenses across 12 judge models, and analyzes the impact of prompt templates and model choice, supplemented by a real-world case study on Alibaba's PAI platform. Key findings show widespread vulnerability to composite and optimization-based attacks, significant dependence on prompt design, and notable robustness gains from task-focused fine-tuning (e.g., JudgeLM-13B), with real-world deployments revealing hidden weaknesses. The work provides practical guidance for building trustworthy LLM-as-a-Judge systems and highlights the need for stronger defenses in security-sensitive evaluation tasks, enabling more reliable automated judging in diverse domains.
Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks, driving the development and widespread adoption of LLM-as-a-Judge systems for automated evaluation, including red teaming and benchmarking. However, these systems are susceptible to adversarial attacks that can manipulate evaluation outcomes, raising critical concerns about their robustness and trustworthiness. Existing evaluation methods for LLM-based judges are often fragmented and lack a unified framework for comprehensive robustness assessment. Furthermore, the impact of prompt template design and model selection on judge robustness has rarely been explored, and their performance in real-world deployments remains largely unverified. To address these gaps, we introduce RobustJudge, a fully automated and scalable framework designed to systematically evaluate the robustness of LLM-as-a-Judge systems. Specifically, RobustJudge investigates the effectiveness of 15 attack methods and 7 defense strategies across 12 models (RQ1), examines the impact of prompt template design and model selection (RQ2), and evaluates the security of real-world deployments (RQ3). Our study yields three key findings: (1) LLM-as-a-Judge systems are highly vulnerable to attacks such as PAIR and combined attacks, while defense mechanisms such as re-tokenization and LLM-based detectors can provide enhanced protection; (2) robustness varies substantially across prompt templates (up to 40%); (3) deploying RobustJudge on Alibaba's PAI platform uncovers previously undiscovered vulnerabilities. These results offer practical insights for building trustworthy LLM-as-a-Judge systems.
