Table of Contents
Fetching ...

Boundary Point Jailbreaking of Black-Box LLMs

Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, Yarin Gal

TL;DR

BPJ reframes jailbreaks against black-box classifier safeguards as a continuation-style, curriculum-driven evolutionary search that relies on boundary-point evaluation under a binary feedback signal. By injecting a noise-interpolated objective $f_q(a)=\mathbb{E}_{x'\sim N_{q,x}}[\mathcal{C}(ax')]$, BPJ meaningfully improves attack prefixes across progressively harder targets and identifies evaluation points near the decision boundary to detect small gains. Theoretical analysis models BPJ as an evolving distribution over prefixes $p_t(a)$ with mutation and quantile-based selection, showing conditions under which progress and fixed points exist, plus a continuation framework across noise levels down to $q=0$. The work demonstrates universal jailbreaks against Constitutional Classifiers and GPT-5’s input classifier and discusses defenses emphasizing batch-level monitoring, highlighting practical implications for AI safety and responsible disclosure in high-stakes deployments. BPJ thus provides a principled, automated avenue for testing and strengthening safeguards, while underscoring the need for multi-faceted defense strategies beyond single-interaction scrutiny.

Abstract

Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5's input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.

Boundary Point Jailbreaking of Black-Box LLMs

TL;DR

BPJ reframes jailbreaks against black-box classifier safeguards as a continuation-style, curriculum-driven evolutionary search that relies on boundary-point evaluation under a binary feedback signal. By injecting a noise-interpolated objective , BPJ meaningfully improves attack prefixes across progressively harder targets and identifies evaluation points near the decision boundary to detect small gains. Theoretical analysis models BPJ as an evolving distribution over prefixes with mutation and quantile-based selection, showing conditions under which progress and fixed points exist, plus a continuation framework across noise levels down to . The work demonstrates universal jailbreaks against Constitutional Classifiers and GPT-5’s input classifier and discusses defenses emphasizing batch-level monitoring, highlighting practical implications for AI safety and responsible disclosure in high-stakes deployments. BPJ thus provides a principled, automated avenue for testing and strengthening safeguards, while underscoring the need for multi-faceted defense strategies beyond single-interaction scrutiny.

Abstract

Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5's input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.
Paper Structure (56 sections, 12 theorems, 68 equations, 12 figures, 1 algorithm)

This paper contains 56 sections, 12 theorems, 68 equations, 12 figures, 1 algorithm.

Key Result

Lemma 1.1

Let $g: \mathcal{X} \to \mathbb{R}$ be another fitness function satisfying $f(a) \ge f(a') \iff g(a) \ge g(a')$ for all $a, a' \in \mathcal{X}$. Then, the weights defined in eq: hard quantile selection weight satisfies $w_\alpha(a; f, r) = w_\alpha(a; g, r)$ for any $a \in \mathcal{X}$ and distribut

Figures (12)

  • Figure 1: BPJ succeeds against Constitutional Classifiers and GPT-5's input classifier. BPJ prefix performs well on challenging unseen biological misuse questions for two of the most difficult public safeguard systems. Max refers to the best score within a query budget of 50 per question. Avg (non-0) averages over queries that result in non-empty output (full average in \ref{['fig:real-world-results-including-zero']}).
  • Figure 2: Curriculum Learning with Noise Interpolation. BPJ solves difficult target queries by generating a curriculum of intermediate targets using an interpolation function. In noise interpolation, we replace $n$ characters in the target harmful text with noise characters. Higher noise levels (right) are easier for the current attack to solve; lower noise levels (left) are harder. During optimisation, BPJ uses a curriculum to calibrate evaluation point difficulty, and additionally searches for evaluation points within each level that are especially good at distinguishing between attacks ("boundary points").
  • Figure 3: Boundary Point Jailbreaking. In our experiments, we (1) generate boundary points by interpolating toward the target harmful text (using random noise) and filtering to points where some but not all current attacks succeed; (2) improve attacks by proposing random token substitutions, insertions, or deletions and keeping modifications that succeed on more boundary points; and (3) replace boundary points when all attacks succeed on them, advancing the curriculum until the attack succeeds on the plaintext target ($n=0$).
  • Figure 4: Noise level decreases during optimisation against prompted GPT-4.1-nano classifier. As BPJ iteratively improves the attack and searches for boundary points, the noise level reduces until it reaches 0 noise and the optimisation concludes.
  • Figure 5: BPJ represents a major improvement over Best-of-N and Curriculum-only algorithms. We plot the negative log probability of the GPT-4.1-nano prompted classifier allowing the harmful sample. This logprob is not seen by the optimisation algorithm but allows us to plot direct progress in the GPT-4.1-nano setting post-hoc. BPJ converges on average 5 times faster than using the noise curriculum alone, and both converge dramatically faster than the Best-of-N prefixes alone.
  • ...and 7 more figures

Theorems & Definitions (31)

  • Lemma 1.1
  • proof
  • Remark 1.2
  • Lemma 1.3
  • proof
  • Proposition 1.4
  • proof
  • Lemma 1.5
  • proof
  • Lemma 1.6
  • ...and 21 more