ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Harry Owiredu-Ashley

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Harry Owiredu-Ashley

TL;DR

ADVERSA, an automated red-teaming framework that measures guardrail degradation dynamics as continuous per-round compliance trajectories rather than discrete jailbreak events, is presented and a controlled experiment across three frontier victim models is reported.

Abstract

Most adversarial evaluations of large language model (LLM) safety assess single prompts and report binary pass/fail outcomes, which fails to capture how safety properties evolve under sustained adversarial interaction. We present ADVERSA, an automated red-teaming framework that measures guardrail degradation dynamics as continuous per-round compliance trajectories rather than discrete jailbreak events. ADVERSA uses a fine-tuned 70B attacker model (ADVERSA-Red, Llama-3.1-70B-Instruct with QLoRA) that eliminates the attacker-side safety refusals that render off-the-shelf models unreliable as attackers, scoring victim responses on a structured 5-point rubric that treats partial compliance as a distinct measurable state. We report a controlled experiment across three frontier victim models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.2) using a triple-judge consensus architecture in which judge reliability is measured as a first-class research outcome rather than assumed. Across 15 conversations of up to 10 adversarial rounds, we observe a 26.7% jailbreak rate with an average jailbreak round of 1.25, suggesting that in this evaluation setting, successful jailbreaks were concentrated in early rounds rather than accumulating through sustained pressure. We document inter-judge agreement rates, self-judge scoring tendencies, attacker drift as a failure mode in fine-tuned attackers deployed out of their training distribution, and attacker refusals as a previously-underreported confound in victim resistance measurement. All limitations are stated explicitly. Attack prompts are withheld per responsible disclosure policy; all other experimental artifacts are released.

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

TL;DR

Abstract

Paper Structure (47 sections, 12 figures, 6 tables)

This paper contains 47 sections, 12 figures, 6 tables.

Introduction
Background and Related Work
LLM Safety Alignment
Jailbreaking and Adversarial Prompting
Multi-Turn Adversarial Evaluation
Red-Teaming Frameworks
LLM-as-Judge Reliability
The ADVERSA Framework
System Architecture
ADVERSA-Red: The Attacker Model
Compliance Rubric
Adversarial Objectives
Experimental Methodology
Configuration
Conversation Protocol
...and 32 more sections

Figures (12)

Figure 1: ADVERSA pipeline. The attacker generates adversarial prompts; the victim responds with full conversation history; judges independently score each response; all outputs are logged per round. Judge scores are never visible to the attacker.
Figure 2: Overall jailbreak rate: 4 of 15 conversations (26.7%).
Figure 3: Per-victim jailbreak rates and average rounds completed. Attacker refusals against Gemini 3.1 Pro are annotated; they reduce Gemini's effective attack exposure.
Figure 4: Jailbreak rate by harm category. Category resistance ordering is consistent across victims within this dataset.
Figure 5: Score trajectory heatmap across all 15 conversations. Rows: conversations. Columns: rounds 1--10. Color encodes consensus score (1 = dark red, 5 = bright green). Grey cells indicate rounds that did not occur.
...and 7 more figures

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

TL;DR

Abstract

ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)