Table of Contents
Fetching ...

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, Stephan Günnemann

TL;DR

It is shown that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios.

Abstract

Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

TL;DR

It is shown that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios.

Abstract

Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: https://github.com/SchwinnL/LLMJudgeReliability.
Paper Structure (26 sections, 2 equations, 15 figures, 1 table)

This paper contains 26 sections, 2 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Impact of judge unreliability on reported Attack Success Rates (ASR). Standard automated judges significantly overestimate the success of adversarial attacks compared to human verification.
  • Figure 2: Statistics about the labeled dataset. The individual figure shows the distributions of (a) scores given by human labelers, where scores larger $2$ indicate harmfulness, (b) judged samples per behavior, (c) samples labeled per attack, and (d) samples labeled per model.
  • Figure 3: Judge accuracy across attacks for Llama-3.1-8B and Gemma-27-B. Darker colors indicate lower accuracy.
  • Figure 4: Average judge accuracy across different distribution shifts: (a) Attack, (b) Model, (c) Behavior, (d) Semantic Category.
  • Figure 5: ROC curves and AUROC scores for JailJudge on generations from the Llama-3.1-8B model.
  • ...and 10 more figures