Table of Contents
Fetching ...

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, Shouling Ji

TL;DR

This paper introduces RobustJudge, a fully automated framework for evaluating the robustness of LLMs used as evaluators (LLM-as-a-Judge) under adversarial manipulation. It systematically assesses 15 attack methods and 7 defenses across 12 judge models, and analyzes the impact of prompt templates and model choice, supplemented by a real-world case study on Alibaba's PAI platform. Key findings show widespread vulnerability to composite and optimization-based attacks, significant dependence on prompt design, and notable robustness gains from task-focused fine-tuning (e.g., JudgeLM-13B), with real-world deployments revealing hidden weaknesses. The work provides practical guidance for building trustworthy LLM-as-a-Judge systems and highlights the need for stronger defenses in security-sensitive evaluation tasks, enabling more reliable automated judging in diverse domains.

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks, driving the development and widespread adoption of LLM-as-a-Judge systems for automated evaluation, including red teaming and benchmarking. However, these systems are susceptible to adversarial attacks that can manipulate evaluation outcomes, raising critical concerns about their robustness and trustworthiness. Existing evaluation methods for LLM-based judges are often fragmented and lack a unified framework for comprehensive robustness assessment. Furthermore, the impact of prompt template design and model selection on judge robustness has rarely been explored, and their performance in real-world deployments remains largely unverified. To address these gaps, we introduce RobustJudge, a fully automated and scalable framework designed to systematically evaluate the robustness of LLM-as-a-Judge systems. Specifically, RobustJudge investigates the effectiveness of 15 attack methods and 7 defense strategies across 12 models (RQ1), examines the impact of prompt template design and model selection (RQ2), and evaluates the security of real-world deployments (RQ3). Our study yields three key findings: (1) LLM-as-a-Judge systems are highly vulnerable to attacks such as PAIR and combined attacks, while defense mechanisms such as re-tokenization and LLM-based detectors can provide enhanced protection; (2) robustness varies substantially across prompt templates (up to 40%); (3) deploying RobustJudge on Alibaba's PAI platform uncovers previously undiscovered vulnerabilities. These results offer practical insights for building trustworthy LLM-as-a-Judge systems.

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

TL;DR

This paper introduces RobustJudge, a fully automated framework for evaluating the robustness of LLMs used as evaluators (LLM-as-a-Judge) under adversarial manipulation. It systematically assesses 15 attack methods and 7 defenses across 12 judge models, and analyzes the impact of prompt templates and model choice, supplemented by a real-world case study on Alibaba's PAI platform. Key findings show widespread vulnerability to composite and optimization-based attacks, significant dependence on prompt design, and notable robustness gains from task-focused fine-tuning (e.g., JudgeLM-13B), with real-world deployments revealing hidden weaknesses. The work provides practical guidance for building trustworthy LLM-as-a-Judge systems and highlights the need for stronger defenses in security-sensitive evaluation tasks, enabling more reliable automated judging in diverse domains.

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks, driving the development and widespread adoption of LLM-as-a-Judge systems for automated evaluation, including red teaming and benchmarking. However, these systems are susceptible to adversarial attacks that can manipulate evaluation outcomes, raising critical concerns about their robustness and trustworthiness. Existing evaluation methods for LLM-based judges are often fragmented and lack a unified framework for comprehensive robustness assessment. Furthermore, the impact of prompt template design and model selection on judge robustness has rarely been explored, and their performance in real-world deployments remains largely unverified. To address these gaps, we introduce RobustJudge, a fully automated and scalable framework designed to systematically evaluate the robustness of LLM-as-a-Judge systems. Specifically, RobustJudge investigates the effectiveness of 15 attack methods and 7 defense strategies across 12 models (RQ1), examines the impact of prompt template design and model selection (RQ2), and evaluates the security of real-world deployments (RQ3). Our study yields three key findings: (1) LLM-as-a-Judge systems are highly vulnerable to attacks such as PAIR and combined attacks, while defense mechanisms such as re-tokenization and LLM-based detectors can provide enhanced protection; (2) robustness varies substantially across prompt templates (up to 40%); (3) deploying RobustJudge on Alibaba's PAI platform uncovers previously undiscovered vulnerabilities. These results offer practical insights for building trustworthy LLM-as-a-Judge systems.

Paper Structure

This paper contains 32 sections, 10 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: An overview of our judge prompt's structure, composed of Component Design, Evaluation Process, and Evaluation Metrics. The bottom section demonstrates a vulnerability where a minimal text perturbation manipulates the judge's score from 9.3 to 10.0.
  • Figure 2: The framework consists of five components: (§\ref{['subsec:dataset_construction']}) Dataset Construction supporting diverse task types (text, vision, code, knowledge); (§\ref{['subsec: attacker']}) Attacker Factory implementing heuristic and optimization-based attacks; (§\ref{['subsec:defense-guard']}) Defense Guard deploying detection and prevention mechanisms; (§\ref{['subsec:llm_judges']}) LLM Judge with customizable prompts and evaluation metrics (pairwise/scoring); and (§\ref{['subsec:attack_imple']}) Result Analysis assessing attack effectiveness, defense performance, prompt impact, and real-world vulnerabilities.
  • Figure 3: Evaluation results across multiple tasks (T1–T8). The tasks include: T1 (Text Translation), T2 (Text Summarization), T3 (Code Translation), T4 (Code Generation), T5 (Code Summarization), T6 (Logical Reasoning), T7 (Mathematics), and T8 (Knowledge Recall).
  • Figure 4: Robustness evaluation of different judge models against 7 adversarial attacks. Each subplot shows the Attack Success Rate (ASR), Success Defense Rate (SDR), and Improved Success Defense Rate (iSDR) for a specific attack method, with (h) presenting the average performance across all attacks.
  • Figure 5: Impact of attack on PAI-Judge variants.