Table of Contents
Fetching ...

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran

Abstract

Large language models remain vulnerable to jailbreak attacks, yet we still lack a systematic understanding of how jailbreak success scales with attacker effort across methods, model families, and harm types. We initiate a scaling-law framework for jailbreaks by treating each attack as a compute-bounded optimization procedure and measuring progress on a shared FLOPs axis. Our systematic evaluation spans four representative jailbreak paradigms, covering optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across multiple model families and scales on a diverse set of harmful goals. We investigate scaling laws that relate attacker budget to attack success score by fitting a simple saturating exponential function to FLOPs--success trajectories, and we derive comparable efficiency summaries from the fitted curves. Empirically, prompting-based paradigms tend to be the most compute-efficient compared to optimization-based methods. To explain this gap, we cast prompt-based updates into an optimization view and show via a same-state comparison that prompt-based attacks more effectively optimize in prompt space. We also show that attacks occupy distinct success--stealthiness operating points with prompting-based methods occupying the high-success, high-stealth region. Finally, we find that vulnerability is strongly goal-dependent: harms involving misinformation are typically easier to elicit than other non-misinformation harms.

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Abstract

Large language models remain vulnerable to jailbreak attacks, yet we still lack a systematic understanding of how jailbreak success scales with attacker effort across methods, model families, and harm types. We initiate a scaling-law framework for jailbreaks by treating each attack as a compute-bounded optimization procedure and measuring progress on a shared FLOPs axis. Our systematic evaluation spans four representative jailbreak paradigms, covering optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across multiple model families and scales on a diverse set of harmful goals. We investigate scaling laws that relate attacker budget to attack success score by fitting a simple saturating exponential function to FLOPs--success trajectories, and we derive comparable efficiency summaries from the fitted curves. Empirically, prompting-based paradigms tend to be the most compute-efficient compared to optimization-based methods. To explain this gap, we cast prompt-based updates into an optimization view and show via a same-state comparison that prompt-based attacks more effectively optimize in prompt space. We also show that attacks occupy distinct success--stealthiness operating points with prompting-based methods occupying the high-success, high-stealth region. Finally, we find that vulnerability is strongly goal-dependent: harms involving misinformation are typically easier to elicit than other non-misinformation harms.
Paper Structure (41 sections, 12 equations, 5 figures, 3 tables)

This paper contains 41 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Compute-normalized scaling curves for jailbreak success. ASR (judge-based average red-team score) vs. attack compute (FLOPs) on Llama-3.1-8B-Instruct; dots and shaded lines denote empirical scores; solid lines denote fitted saturating exponentials (Eq. \ref{['eq:exp_fit']}).
  • Figure 2: Compute-normalized scaling curves. Average red-team score (left) and relevance score (right) vs. attack compute (FLOPs) on Llama-3.1-8B-Instruct. Dots and shaded lines denote empirical scores; solid lines denote fitted saturating exponentials (Eq. \ref{['eq:exp_fit']}).
  • Figure 3: ASR--stealthiness operating points. Attacks occupy distinct operating points in the asymptotic ceiling ASR vs. stealthiness space (higher is better for both axes).
  • Figure 4: Goal-category scaling. ASR vs. compute budget (TFLOPs) decomposed by goal category (defined in Section \ref{['sec:datasets-tasks']}), showing heterogeneous scaling including saturation and occasional non-monotonic regimes.
  • Figure 5: FLOPs-aligned scaling under compute alignment (PAIR). Within-family size scaling (left) and cross-family generalization (right).