Plentiful Jailbreaks with String Compositions
Brian R. Y. Huang
TL;DR
This work addresses persistent adversarial vulnerabilities in frontier LLMs arising from encoding-based obfuscations. It introduces a formal framework of invertible string transformations and builds a large library of 20 building blocks to enable end-to-end string compositions that encode both prompts and outputs. By combining transformations into ensembles and employing a best-of-$n$ adaptive sampling strategy, the authors demonstrate high jailbreak success rates on HarmBench across multiple frontier models, highlighting the scalability and breadth of the attack surface. The findings underscore the need for defenses targeting generalized encoding-based attacks and advocate for stronger, scalable red-teaming in model safety research.
Abstract
Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.
