Boundary Point Jailbreaking of Black-Box LLMs
Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, Yarin Gal
TL;DR
BPJ reframes jailbreaks against black-box classifier safeguards as a continuation-style, curriculum-driven evolutionary search that relies on boundary-point evaluation under a binary feedback signal. By injecting a noise-interpolated objective $f_q(a)=\mathbb{E}_{x'\sim N_{q,x}}[\mathcal{C}(ax')]$, BPJ meaningfully improves attack prefixes across progressively harder targets and identifies evaluation points near the decision boundary to detect small gains. Theoretical analysis models BPJ as an evolving distribution over prefixes $p_t(a)$ with mutation and quantile-based selection, showing conditions under which progress and fixed points exist, plus a continuation framework across noise levels down to $q=0$. The work demonstrates universal jailbreaks against Constitutional Classifiers and GPT-5’s input classifier and discusses defenses emphasizing batch-level monitoring, highlighting practical implications for AI safety and responsible disclosure in high-stakes deployments. BPJ thus provides a principled, automated avenue for testing and strengthening safeguards, while underscoring the need for multi-faceted defense strategies beyond single-interaction scrutiny.
Abstract
Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5's input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.
