Table of Contents
Fetching ...

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr

TL;DR

The Jailbreak Tax introduces a rigorous, ground-truth framework to evaluate the usefulness of outputs from jailbroken LLMs, defining the jailbreak tax as the loss of baseline utility when safety guardrails are bypassed. By constructing objective tasks in biology and mathematics and applying multiple alignment and jailbreak techniques, the authors show that bypassing guardrails often incurs substantial utility loss, with the tax varying widely across attacks and model sizes. The work provides five benchmark suites and formal metrics, revealing that higher jailbreak success does not guarantee higher utility recovery, and that alignment types influence utility differently. This approach offers a practical safety lens for evaluating jailbreaks and highlights the need for community benchmarks and access to unaligned models for rigorous safety assessment.

Abstract

Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

TL;DR

The Jailbreak Tax introduces a rigorous, ground-truth framework to evaluate the usefulness of outputs from jailbroken LLMs, defining the jailbreak tax as the loss of baseline utility when safety guardrails are bypassed. By constructing objective tasks in biology and mathematics and applying multiple alignment and jailbreak techniques, the authors show that bypassing guardrails often incurs substantial utility loss, with the tax varying widely across attacks and model sizes. The work provides five benchmark suites and formal metrics, revealing that higher jailbreak success does not guarantee higher utility recovery, and that alignment types influence utility differently. This approach offers a practical safety lens for evaluating jailbreaks and highlights the need for community benchmarks and access to unaligned models for rigorous safety assessment.

Abstract

Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax

Paper Structure

This paper contains 38 sections, 7 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Illustration of our results. We align a LLaMa 3.1 70B model to refuse questions on bio-security (WMDP) and math (GSM8K and MATH). After being jailbroken, the model responds to questions but some attacks incur a significant reduction in utility (the jailbreak tax).
  • Figure 2: Overview of our framework. Left: We ask models benign questions for which correctness is easy to verify (e.g., in mathematics). Middle: We align models to refuse to answer questions on this topic. Right: we use jailbreaks to circumvent alignment, and check if the jailbroken model responds correctly (in this case it does not). We refer to the drop in model abilities due to jailbreaks as the jailbreak tax.
  • Figure 3: Jailbreak success rate (JailSucc) and jailbreak tax (JTax) for various jailbreak attacks against a LLaMA 3.1 70B model with system prompt alignment on WMDP (left) and GSM8K (right) datasets. The error bars show 95% confidence interval.
  • Figure 4: Jailbreak success rate (JailSucc) and jailbreak tax (JTax) for various jailbreak attacks against a LLaMA 3.1 70B model with SFT alignment on WMDP (left) and GSM8K (right) datasets. The error bars show 95% confidence interval.
  • Figure 5: Jailbreak success rate (JailSucc) and jailbreak tax (JTax) for various jailbreak attacks against Claude 3.5-Haiku on the EvilMath dataset. The error bars show 95% confidence interval.
  • ...and 6 more figures