Table of Contents
Fetching ...

Capability-Based Scaling Laws for LLM Red-Teaming

Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping

TL;DR

The paper investigates how red-teaming success for LLMs scales with attacker and target capabilities by framing it as a capability-gap problem. Using two LL-based jailbreaks (PAIR and Crescendo) across 500+ attacker–target pairs and HarmBench for evaluation, it shows that attack success scales linearly with attacker capability but follows a sigmoid decline as the target surpasses the attacker, enabling a capability-gap scaling law. It reveals that social-science competencies (as captured by $MMLU{-}Pro$ splits) are stronger predictors of attack success than STEM knowledge, and it provides both Bayesian and bootstrap methods to model and quantify uncertainty in the scaling law. The results imply that fixed-capability red-teamers (e.g., humans) may become less effective as models advance, while open-source models pose increasing risks, underscoring the need for capability-aware safety evaluation and scalable automated red-teaming in deployment. All mathematical notation uses $...$ delimiters as required.

Abstract

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

Capability-Based Scaling Laws for LLM Red-Teaming

TL;DR

The paper investigates how red-teaming success for LLMs scales with attacker and target capabilities by framing it as a capability-gap problem. Using two LL-based jailbreaks (PAIR and Crescendo) across 500+ attacker–target pairs and HarmBench for evaluation, it shows that attack success scales linearly with attacker capability but follows a sigmoid decline as the target surpasses the attacker, enabling a capability-gap scaling law. It reveals that social-science competencies (as captured by splits) are stronger predictors of attack success than STEM knowledge, and it provides both Bayesian and bootstrap methods to model and quantify uncertainty in the scaling law. The results imply that fixed-capability red-teamers (e.g., humans) may become less effective as models advance, while open-source models pose increasing risks, underscoring the need for capability-aware safety evaluation and scalable automated red-teaming in deployment. All mathematical notation uses delimiters as required.

Abstract

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

Paper Structure

This paper contains 28 sections, 16 equations, 11 figures, 6 tables, 2 algorithms.

Figures (11)

  • Figure 1: Overview of Our Contributions:(1) We evaluate over 500 attacker-target combinations with two jailbreak techniques and find that attacker success rate scales linearly with general capability (measured with MMLU-Pro scores). (2) However, for a fixed target model the attack success rate follows a sigmoid curve and can be predicted accurately from the attacker-target capability gap. (3) Using the resulting capability-based scaling law, we forecast that red-teaming for a fixed attacker, such as a human, will inevitably become less effective as target models' capabilities increase.
  • Figure 2: All Attacker-Target Combinations. We evaluate over 500 attacker-target pairs, with each heatmap cell showing the max per-pair Attack Success Rate (ASR) in eliciting unsafe behaviors (over the first 50 queries in HarmBench), aggregated across both attacks, PAIR and Crescendo. Column view: Sorted by Average Target ASR (last row), lighter-colored columns (e.g., Llama2-13b) indicating more robust targets. Row view: Sorted by Attacker MMLU-Pro, darker-colored rows (e.g., Qwen2.5-32b) indicating stronger attackers. From the last column, Average Attacker ASR, we observe that it increases with attacker capability. Llama3.2-1b being the least capable model and o3 (target-only) the most capable in our analysis (based on MMLU-Pro).
  • Figure 3: More Capable Models Are Stronger as Both Attackers and Targets.Left: Attacker Success Rate, averaged over all targets, increases linearly with attacker capability. Right: Target Vulnerability, defined as the max achieved per-target ASR, decreases with target capability. Models generally follow a sigmoid-like trend, with only early Llama models (Llama2 and Llama3-8b) emerging as outliers. $R^2$ is reported for each fit excluding outliers, alongside with Spearman $\rho$.
  • Figure 4: Capability-Based Jailbreaking Scaling Laws.Top: Per-target scaling. For each target model we fit a linear model in logit space using the max achieved ASR of every attacker-target pair, then map predictions back to probability space; shaded bands show the $95\%$ bootstrap confidence interval. Bottom: Family-level scaling. Per-target curves from the same family are aggregated into a single scaling law, which we test on new targets, not part of the model family. The Qwen-2.5 curve generalizes best, closely matching the closed-source state-of-the-art reasoning models.
  • Figure 5: A Forecast for Human Red-Teaming. Using the aggregated scaling law across all target models, we predict ASR for a fixed human attacker (modelled as 0.898 on MMLU-Pro). The forecast shows a continued decline as future models grow more capable and capability gap widens. For the reference, we add the highest achieved ASR with an LLM-attacker in our study.
  • ...and 6 more figures