Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez
TL;DR
This work formalizes the LLM Bomb-Defense Problem in a narrow-domain setting, introducing a threat model with limited log-prob access and a defense objective $M_d$ that must not reveal advanced bomb-making information beyond the input $x$ and attacker $\mathcal{A}$. It shows that standard safety training, static adversarial training, and conventional input/output classifiers each suffer vulnerabilities under targeted attacks. To improve defenses, the authors propose a transcript-based classifier, CoT-4o, which transforms transcripts, applies chain-of-thought reasoning, and strict parsing to detect and block potentially harmful requests. However, even this approach is not foolproof: a single static attack defeats it, and further tuning sometimes degrades performance. The findings suggest that narrow-domain jailbreak defense is still challenging and that insights gained may inform broader safeguards against jailbreaks.
Abstract
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.
