Table of Contents
Fetching ...

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

Samuel Nathanson, Rebecca Williams, Cynthia Matuszek

TL;DR

The paper investigates how adversarial vulnerabilities scale when LLMs interact, by simulating 6000 attacker–target exchanges across sizes from $0.6B$ to $120B$ parameters using JailbreakBench prompts. It adopts a three-model setup (attacker, target, judge) with judge-based harm scoring ($1$–$5$) and automatic attacker refusal detection to quantify safety gaps. The key finding is a positive relationship between the attacker–target size ratio and harm (Pearson $r=0.510$, Spearman $\rho=0.519$, $p<0.001$), with attacker-side variance driving most harm variability and attacker refusals strongly mitigating harm ($\rho=-0.927$, $p<0.001$). These results suggest emergent scaling patterns in adversarial alignment, implying that robustness in multi-LLM systems depends on relative capabilities and alignment of all participating agents, not just individual target safeguards. The work provides a foundation for causal, training-controlled studies and longer-horizon multi-agent investigations to inform scalable safety strategies.

Abstract

Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM families and scales (0.6B-120B parameters), measuring both harm score and refusal behavior as indicators of adversarial potency and alignment integrity. Each interaction is evaluated through aggregated harm and refusal scores assigned by three independent LLM judges, providing a consistent, model-based measure of adversarial outcomes. Aggregating results across prompts, we find a strong and statistically significant correlation between mean harm and the logarithm of the attacker-to-target size ratio (Pearson r = 0.51, p < 0.001; Spearman rho = 0.52, p < 0.001), indicating that relative model size correlates with the likelihood and severity of harmful completions. Mean harm score variance is higher across attackers (0.18) than across targets (0.10), suggesting that attacker-side behavioral diversity contributes more to adversarial outcomes than target susceptibility. Attacker refusal frequency is strongly and negatively correlated with harm (rho = -0.93, p < 0.001), showing that attacker-side alignment mitigates harmful responses. These findings reveal that size asymmetry influences robustness and provide exploratory evidence for adversarial scaling patterns, motivating more controlled investigations into inter-model alignment and safety.

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

TL;DR

The paper investigates how adversarial vulnerabilities scale when LLMs interact, by simulating 6000 attacker–target exchanges across sizes from to parameters using JailbreakBench prompts. It adopts a three-model setup (attacker, target, judge) with judge-based harm scoring () and automatic attacker refusal detection to quantify safety gaps. The key finding is a positive relationship between the attacker–target size ratio and harm (Pearson , Spearman , ), with attacker-side variance driving most harm variability and attacker refusals strongly mitigating harm (, ). These results suggest emergent scaling patterns in adversarial alignment, implying that robustness in multi-LLM systems depends on relative capabilities and alignment of all participating agents, not just individual target safeguards. The work provides a foundation for causal, training-controlled studies and longer-horizon multi-agent investigations to inform scalable safety strategies.

Abstract

Large language models (LLMs) increasingly operate in multi-agent and safety-critical settings, raising open questions about how their vulnerabilities scale when models interact adversarially. This study examines whether larger models can systematically jailbreak smaller ones - eliciting harmful or restricted behavior despite alignment safeguards. Using standardized adversarial tasks from JailbreakBench, we simulate over 6,000 multi-turn attacker-target exchanges across major LLM families and scales (0.6B-120B parameters), measuring both harm score and refusal behavior as indicators of adversarial potency and alignment integrity. Each interaction is evaluated through aggregated harm and refusal scores assigned by three independent LLM judges, providing a consistent, model-based measure of adversarial outcomes. Aggregating results across prompts, we find a strong and statistically significant correlation between mean harm and the logarithm of the attacker-to-target size ratio (Pearson r = 0.51, p < 0.001; Spearman rho = 0.52, p < 0.001), indicating that relative model size correlates with the likelihood and severity of harmful completions. Mean harm score variance is higher across attackers (0.18) than across targets (0.10), suggesting that attacker-side behavioral diversity contributes more to adversarial outcomes than target susceptibility. Attacker refusal frequency is strongly and negatively correlated with harm (rho = -0.93, p < 0.001), showing that attacker-side alignment mitigates harmful responses. These findings reveal that size asymmetry influences robustness and provide exploratory evidence for adversarial scaling patterns, motivating more controlled investigations into inter-model alignment and safety.

Paper Structure

This paper contains 23 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Relationship between mean harm and the logarithm of the attacker-to-target model size ratio. Each point represents an attacker–target pair averaged across prompts. Correlations are Pearson $r=0.510$ and Spearman $\rho=0.519$, illustrating a consistent positive correlation between relative model scale and harm.
  • Figure 2: Ridgeline distributions of attacker-to-target size ratios (log scale) stratified by rounded discrete harm levels. Higher harm strata shift toward larger relative attacker sizes, showing that severe jailbreaks are more likely when attackers substantially exceed target scale. Harm dispersion appears similar between attack domain, with stronger differences at harm = 5, especially for physical harm.
  • Figure 3: Attacker refusal rates by model size. Refusal frequency decreases with model scale, reflecting stronger adversarial persistence in larger models. The strong negative correlation between refusal rate and mean harm ($\rho=-0.927$, $p<0.001$) indicates that refusal behavior is a key protective mechanism against harmful outputs.
  • Figure 4: Combined heatmap and variance plots showing mean harm scores across attacker (columns) and target (rows) models. Bar charts indicate attacker- and target-side harm variance ($0.180$ and $0.097$), highlighting greater dispersion among attackers than targets.
  • Figure 5: Scatter matrix of harm distributions across all attacker–target pairs (attacker refusals excluded). Each cell shows individual run outcomes (harm 1–5) for a given pairing, with median lines and interquartile bands summarizing variability. This visualization highlights cross-family consistency and the broader scaling pattern linking attacker size to harm intensity.
  • ...and 1 more figures