Table of Contents
Fetching ...

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Stuart Armstrong, Matija Franklin, Connor Stevens, Rebecca Gorman

TL;DR

The paper addresses the vulnerability of LLMs to Best-of-N jailbreaking by proposing DATDP, a proactive defense that uses an evaluation agent to preemptively assess prompts for danger and jailbreak attempts. The method relies on an iterative, weighted scoring process to classify prompts as safe or harmful before they reach the responding model, and demonstrates near-perfect blocking on augmented prompts across multiple datasets and models, including smaller evaluation models like LLaMa-3-8B-instruct. Key findings show that both large and small evaluation models can substantially reduce jailbreak success, with $>99\%$ blocking on augmented prompts and $100\%$ blocking on BoN jailbreaking prompts in several configurations. The approach is released as open-source, highlighting its practical potential for scalable AI safety by adding a lightweight, preemptive defense layer that complements internal model safeguards.

Abstract

Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that $100\%$ of the BoN paper's successful jailbreaks (confidence interval $[99.65\%, 100.00\%]$) and $99.8\%$ of successful jailbreaks in our replication (confidence interval $[99.28\%, 99.98\%]$) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors--unlike some other approaches, DATDP also explicitly looks for jailbreaking attempts--until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety.

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

TL;DR

The paper addresses the vulnerability of LLMs to Best-of-N jailbreaking by proposing DATDP, a proactive defense that uses an evaluation agent to preemptively assess prompts for danger and jailbreak attempts. The method relies on an iterative, weighted scoring process to classify prompts as safe or harmful before they reach the responding model, and demonstrates near-perfect blocking on augmented prompts across multiple datasets and models, including smaller evaluation models like LLaMa-3-8B-instruct. Key findings show that both large and small evaluation models can substantially reduce jailbreak success, with blocking on augmented prompts and blocking on BoN jailbreaking prompts in several configurations. The approach is released as open-source, highlighting its practical potential for scalable AI safety by adding a lightweight, preemptive defense layer that complements internal model safeguards.

Abstract

Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that of the BoN paper's successful jailbreaks (confidence interval ) and of successful jailbreaks in our replication (confidence interval ) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors--unlike some other approaches, DATDP also explicitly looks for jailbreaking attempts--until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety.

Paper Structure

This paper contains 25 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the DATDP pipeline. An evaluation agent assesses each user-submitted prompt until high confidence is reached. Then the prompt is blocked (if dangerous) or passed through to the responding LLM (if safe).
  • Figure 2: LLMs perform very differently in an evaluation agent role, versus a responding or assistant role. Here, LLaMa-3-8B-instruct as an assistant cheerfully provides suggestions on how to spread a virus. As part of an evaluation agent, however, it blocks the prompts and correctly articulates why.
  • Figure 3: DATDP on Datasets of Augmented Prompts This Figure shows the performance of the evaluation agent on four datasets, using either Claude (solid bar) or LLaMa-3-8B-instruct (bar with hatches) as the base model. The four datasets are: the successful Jailbreaking prompts from the BoN paper, the prompts that successfully jailbroke LLaMa-3-8B-instruct in our replication, the prompts that failed to jailbreak LLaMa-3-8B-instruct in our replication, and generic augmented prompts. All these prompts should be blocked; the Y-axis shows the percentage correctly blocked. Y-axis range: $90\%-100\%$.
  • Figure 4: DATDP on Datasets of Augmented Prompts This figure shows the performance of the evaluation agent on two datasets, using either Claude (solid bar) or LLaMa-3-8B-instruct (bar with hatches) as the base model. The two datasets are: $159$ prompts from HarmBench's original dangerous prompt set (which should be blocked) and normal, non-dangerous prompts (which should be accepted). Neither of these is augmented. The Y-axis shows the percentage correctly classified. Y-axis range: $90\%-100\%$.
  • Figure 5: Proportion of blocked prompts as a function of $N$, the number of iterations. The figure illustrates the performance of the DATDP mechanism with LLaMa-3-8B-instruct for different prompt datasets across multiple iterations of evaluation.