Capability-Based Scaling Laws for LLM Red-Teaming
Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, Jonas Geiping
TL;DR
The paper investigates how red-teaming success for LLMs scales with attacker and target capabilities by framing it as a capability-gap problem. Using two LL-based jailbreaks (PAIR and Crescendo) across 500+ attacker–target pairs and HarmBench for evaluation, it shows that attack success scales linearly with attacker capability but follows a sigmoid decline as the target surpasses the attacker, enabling a capability-gap scaling law. It reveals that social-science competencies (as captured by $MMLU{-}Pro$ splits) are stronger predictors of attack success than STEM knowledge, and it provides both Bayesian and bootstrap methods to model and quantify uncertainty in the scaling law. The results imply that fixed-capability red-teamers (e.g., humans) may become less effective as models advance, while open-source models pose increasing risks, underscoring the need for capability-aware safety evaluation and scalable automated red-teaming in deployment. All mathematical notation uses $...$ delimiters as required.
Abstract
As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a weak-to-strong problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the capability gap between attacker and target. We evaluate more than 500 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target's capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these trends, we derive a jailbreaking scaling law that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.
