Table of Contents
Fetching ...

Endless Jailbreaks with Bijection Learning

Brian R. Y. Huang, Maximilian Li, Leonard Tang

TL;DR

Endless Jailbreaks with Bijection Learning presents a black-box, universal, scale-adaptive jailbreak for frontier LLMs by teaching models to use bijective encodings through in-context learning. The method introduces tunable encoding complexity via dispersion and encoding length and leverages best-of-$n$ sampling to fuzz prompts, achieving state-of-the-art ASR on HarmBench and AdvBench across multiple models. The study reveals a Pareto frontier between model capability and bijection difficulty, showing stronger vulnerabilities in more capable models, and demonstrates persistent defense challenges even with guardrails. These findings underscore the need for scale-aware safety measures and targeted red-teaming to anticipate emergent attack vectors as models continue to grow.

Abstract

Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce bijection learning, a powerful attack algorithm which automatically fuzzes LLMs for safety vulnerabilities using randomly-generated encodings whose complexity can be tightly controlled. We leverage in-context learning to teach models bijective encodings, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English. Our attack is extremely effective on a wide range of frontier language models. Moreover, by controlling complexity parameters such as number of key-value mappings in the encodings, we find a close relationship between the capability level of the attacked LLM and the average complexity of the most effective bijection attacks. Our work highlights that new vulnerabilities in frontier models can emerge with scale: more capable models are more severely jailbroken by bijection attacks.

Endless Jailbreaks with Bijection Learning

TL;DR

Endless Jailbreaks with Bijection Learning presents a black-box, universal, scale-adaptive jailbreak for frontier LLMs by teaching models to use bijective encodings through in-context learning. The method introduces tunable encoding complexity via dispersion and encoding length and leverages best-of- sampling to fuzz prompts, achieving state-of-the-art ASR on HarmBench and AdvBench across multiple models. The study reveals a Pareto frontier between model capability and bijection difficulty, showing stronger vulnerabilities in more capable models, and demonstrates persistent defense challenges even with guardrails. These findings underscore the need for scale-aware safety measures and targeted red-teaming to anticipate emergent attack vectors as models continue to grow.

Abstract

Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce bijection learning, a powerful attack algorithm which automatically fuzzes LLMs for safety vulnerabilities using randomly-generated encodings whose complexity can be tightly controlled. We leverage in-context learning to teach models bijective encodings, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English. Our attack is extremely effective on a wide range of frontier language models. Moreover, by controlling complexity parameters such as number of key-value mappings in the encodings, we find a close relationship between the capability level of the attacked LLM and the average complexity of the most effective bijection attacks. Our work highlights that new vulnerabilities in frontier models can emerge with scale: more capable models are more severely jailbroken by bijection attacks.
Paper Structure (35 sections, 1 equation, 11 figures, 3 tables)

This paper contains 35 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: An overview of the bijection learning attack, which uses in-context learning and bijective mappings with complexity parameters to optimally jailbreak LLMs of different capability levels.
  • Figure 2: Examples of bijections taught in our attack. Letters can be mapped to other letters, $\ell$-digit numbers, tokens, and more. We control the dispersion parameter, or the number of letters that do not map to themselves, to modulate the complexity of a bijection.
  • Figure 3: We visualize the increase in the ASRs of bijection learning as the attack budget increases.
  • Figure 4: ASRs on HarmBench-35 for bijection learning with different dispersions and bijection types for Claude 3 Haiku (left) and GPT-4o-mini (right) with an attack budget of $n=6$.
  • Figure 5: As we increase dispersion in bijection learning for smaller models, (i) ASR increases and then decreases to zero, (ii) refusal decreases to zero, and (iii) incoherency and unhelpfulness increase, corresponding to a failure to learn bijections at the highest dispersion values.
  • ...and 6 more figures