Table of Contents
Fetching ...

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez

TL;DR

This work formalizes the LLM Bomb-Defense Problem in a narrow-domain setting, introducing a threat model with limited log-prob access and a defense objective $M_d$ that must not reveal advanced bomb-making information beyond the input $x$ and attacker $\mathcal{A}$. It shows that standard safety training, static adversarial training, and conventional input/output classifiers each suffer vulnerabilities under targeted attacks. To improve defenses, the authors propose a transcript-based classifier, CoT-4o, which transforms transcripts, applies chain-of-thought reasoning, and strict parsing to detect and block potentially harmful requests. However, even this approach is not foolproof: a single static attack defeats it, and further tuning sometimes degrades performance. The findings suggest that narrow-domain jailbreak defense is still challenging and that insights gained may inform broader safeguards against jailbreaks.

Abstract

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

TL;DR

This work formalizes the LLM Bomb-Defense Problem in a narrow-domain setting, introducing a threat model with limited log-prob access and a defense objective that must not reveal advanced bomb-making information beyond the input and attacker . It shows that standard safety training, static adversarial training, and conventional input/output classifiers each suffer vulnerabilities under targeted attacks. To improve defenses, the authors propose a transcript-based classifier, CoT-4o, which transforms transcripts, applies chain-of-thought reasoning, and strict parsing to detect and block potentially harmful requests. However, even this approach is not foolproof: a single static attack defeats it, and further tuning sometimes degrades performance. The findings suggest that narrow-domain jailbreak defense is still challenging and that insights gained may inform broader safeguards against jailbreaks.

Abstract

Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.

Paper Structure

This paper contains 74 sections, 1 equation, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1.1: Our transcript classifier defense: (1) Generate transcript --- Transform user requests and assistant responses into capitalized, XML-tagged text with unique UUIDs to prevent prompt injection. (2) Transcript classifier --- Employ an LLM with chain-of-thought reasoning and a single prompt to evaluate potential requests for harmful information, starting with a manipulation check, identifying dangerous inquiries, and assessing responses for inadvertent risks. (3) Parsing and judgment --- Explicitly validate each reasoning step; a parsing failure or 'yes' on any checklist item triggers a system block of the output and issues a user refusal.
  • Figure C.1: Histogram of examples that the PAIR algorithm finds for each classifier as a function of the probability that it is harmful. The red dotted line shows the threshold of 5% AlpacaEval Refusal Rate, so examples to the left of this were manually checked to see if they were competent failures.
  • Figure C.2: Egregious word histogram for each generation model we ran the PAIR algorithm on for between 10 and 30 occurrences. We sample 10 transcripts to manually grade from this distribution.
  • Figure C.3: Egregious word histogram for examples not flagged by our CoT-4o classifier. These were manually checked to try and find false negatives, but none were found.
  • Figure D.1: The effectiveness of a grey-box adversarial suffix attack on chain-of-thought (CoT) classifiers compared to non-CoT classifiers. In this experiment, CoT-4o uses gpt-3.5-turbo-0125 to save on compute costs. Using CoT demonstrates significantly greater resilience to suffix attacks. We drop points if they have an incomplete CoT since they cause spikes in probability that do not correspond to jailbreaks (see Appendix \ref{['app:rs-spikes']}).
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1: LLM Bomb-Defense Problem