Table of Contents
Fetching ...

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

Lidor Erez, Omer Hofman, Tamir Nizri, Roman Vainshtein

Abstract

Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to 33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

Abstract

Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to 33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.
Paper Structure (28 sections, 2 equations, 11 figures, 2 tables)

This paper contains 28 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Example of Evaluator-induced measurement instability in LLM vulnerability scanners. The same prompt and model response can produce different Attack Success Rate (ASR) outcomes depending on the evaluator used to determine attack success. A static keyword-based evaluator incorrectly labels the attack as successful, while an LLM-based evaluator correctly interprets the response as a refusal.
  • Figure 2: Typical LLM vulnerability scanning pipeline. An attack dataset is used to generate prompts for a target model; responses are evaluated by an automated component into binary labels (0/1) and aggregated into ASR. Different evaluator designs (e.g., static matching vs. LLM-based judging) produce different ASR values.
  • Figure 3: Two-phase evaluation framework. Phase I (diagnostic) measures disagreement between two evaluators applied to the same prompt–response pairs to identify unstable attack categories. Phase II (remediation) applies independent verification to estimate evaluator reliability and supports targeted evaluator replacement.
  • Figure 4: Evaluator disagreement rate $D$ per attack (mean $\pm$ std across 3 target models), sorted by $D$. The dashed line marks the reliability threshold $\tau = 0.05$. $22$ of $25$ attacks exceed $\tau$; 6 exhibit $D > 0.50$, indicating near-random evaluator consistency for those attack categories.
  • Figure 5: ASR with evaluator-induced uncertainty intervals for the Mistral-Small model. Error bars reflect the range of ASR estimates obtained under alternative evaluator decisions.
  • ...and 6 more figures