Table of Contents
Fetching ...

Plentiful Jailbreaks with String Compositions

Brian R. Y. Huang

TL;DR

This work addresses persistent adversarial vulnerabilities in frontier LLMs arising from encoding-based obfuscations. It introduces a formal framework of invertible string transformations and builds a large library of 20 building blocks to enable end-to-end string compositions that encode both prompts and outputs. By combining transformations into ensembles and employing a best-of-$n$ adaptive sampling strategy, the authors demonstrate high jailbreak success rates on HarmBench across multiple frontier models, highlighting the scalability and breadth of the attack surface. The findings underscore the need for defenses targeting generalized encoding-based attacks and advocate for stronger, scalable red-teaming in model safety research.

Abstract

Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.

Plentiful Jailbreaks with String Compositions

TL;DR

This work addresses persistent adversarial vulnerabilities in frontier LLMs arising from encoding-based obfuscations. It introduces a formal framework of invertible string transformations and builds a large library of 20 building blocks to enable end-to-end string compositions that encode both prompts and outputs. By combining transformations into ensembles and employing a best-of- adaptive sampling strategy, the authors demonstrate high jailbreak success rates on HarmBench across multiple frontier models, highlighting the scalability and breadth of the attack surface. The findings underscore the need for defenses targeting generalized encoding-based attacks and advocate for stronger, scalable red-teaming in model safety research.

Abstract

Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.

Paper Structure

This paper contains 12 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Jailbreak efficacy on HarmBench for all transformations and for the ensemble attack. For each model, we employ the attack prompt in §\ref{['passage:attack-setup']} using each standalone transformation in our catalog as a singleton composition. ASRs for each standalone transformation are displayed. We ensemble our attacks by counting an intent as jailbroken if at least one of the 20 standalone transformations led to an unsafe response. The ensemble ASRs are displayed at the rightmost bar for each model.
  • Figure 2: Jailbreak efficacy on HarmBench for our automated adaptive attack, based on randomly sampling string compositions. Right: we run the adaptive attack with attack budget $n = 25$ and report ASRs for three Claude models as well as GPT-4o-mini. Left: a non-adaptive attack ($n = 1$) obtains low ASRs, so the retry-and-resample mechanism of our attack at higher attack budgets is crucial for jailbreaking a high number of intents. The equal ASRs for GPT-4o and GPT-4o-mini at several $n$ are not a typo and actually arised in our experiment; we attribute these coincidences to divine whimsy.
  • Figure 3: This prompt is formed from our template when we specify the composition $(f_1, f_2, f_3) = (\texttt{alternating case}, \texttt{word-level reversal}, \texttt{JSON encapsulation})$, specify that the composition is performed on the response, and separately encode our queries with the leetspeak transformation.