Table of Contents
Fetching ...

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu

TL;DR

JAILBREAK FOUNDRY is introduced, a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness and offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

Abstract

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

TL;DR

JAILBREAK FOUNDRY is introduced, a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness and offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

Abstract

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.
Paper Structure (73 sections, 3 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 73 sections, 3 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Jailbreak Foundry (JBF) overview. JBF-Lib provides shared contracts and utilities, JBF-Forge translates papers into runnable modules, and JBF-Eval evaluates them with fixed datasets, protocols, and judging, enabling comparable cross-attack and cross-model results.
  • Figure 2: With-repo vs. no-repo reproduction on five selected attacks (recent methods, largest gains in \ref{['tab:asr_by_dataset_compact_resized_timeline']}, and one older baseline). Bars show ASR(%) using paper text only vs. paper+official runnable repo (when available).
  • Figure 3: ASR(%) for six attacks with the largest baseline reproduction gaps.
  • Figure 4: JBF-Eval ASR heatmap on AdvBench: standardized attack success rates (%) for $30$ attacks (x-axis) across $10$ victim models (y-axis) under a unified harness and judge; warmer colors indicate higher ASR.
  • Figure 5: Implementation iterations vs. attack performance. Each point denotes an attack (or variant) with its mean ASR averaged over 10 victim models; colors indicate the number of implementation--audit iterations required to reach the auditor's acceptance. Diamonds and vertical bars report the group mean $\pm$ std. The dashed line is a least-squares trend, showing a weak and non-significant association between iteration count and ASR ($r=0.186$, $p=0.325$).
  • ...and 5 more figures