Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Xiaosen Zheng; Tianyu Pang; Chao Du; Qian Liu; Jing Jiang; Min Lin

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin

TL;DR

The paper presents I-FSJ, an automated improved few-shot jailbreaking method that uses a demo pool, system-token injection, and demo-level random search to exploit open-source LLMs with limited context. It demonstrates high ASRs across multiple models and defenses, including near-100% ASR in several cases, and shows that defenses such as perplexity filters, SmoothLLM, and Guard-style detectors remain insufficient under few-shot, token-aware attacks. The work highlights significant vulnerabilities in current alignment approaches and provides a data-efficient, automated baseline for evaluating and strengthening LLM safety. It also discusses broader implications for defense design and the limits of current safety mechanisms in the face of semantically meaningful, token-aware jailbreaks.

Abstract

Recently, Anil et al. (2024) show that many-shot (up to hundreds of) demonstrations can jailbreak state-of-the-art LLMs by exploiting their long-context capability. Nevertheless, is it possible to use few-shot demonstrations to efficiently jailbreak LLMs within limited context sizes? While the vanilla few-shot jailbreaking may be inefficient, we propose improved techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool. These simple techniques result in surprisingly effective jailbreaking against aligned LLMs (even with advanced defenses). For examples, our method achieves >80% (mostly >95%) ASRs on Llama-2-7B and Llama-3-8B without multiple restarts, even if the models are enhanced by strong defenses such as perplexity detection and/or SmoothLLM, which is challenging for suffix-based jailbreaking. In addition, we conduct comprehensive and elaborate (e.g., making sure to use correct system prompts) evaluations against other aligned LLMs and advanced defenses, where our method consistently achieves nearly 100% ASRs. Our code is available at https://github.com/sail-sg/I-FSJ.

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

TL;DR

Abstract

Paper Structure (22 sections, 15 figures, 9 tables, 2 algorithms)

This paper contains 22 sections, 15 figures, 9 tables, 2 algorithms.

Introduction
Related work
Improved few-shot jailbreaking
Preliminaries
Improved strategies
Empirical studies
Implementation details
Jailbreaking attacks on aligned LLMs
Jailbreaking attacks on Llama-2-7B-Chat + jailbreaking defenses
Further analysis
Discussion
Broader Impacts and Limitations
Implementation details
Demo-level random search for SmoothLLM
The setup of metrics
...and 7 more sections

Figures (15)

Figure 1: Injecting special tokens into the generated demonstrations on Llama-2-7B-Chat. Given an original FSJ demonstration, we construct $\mathcal{I}$-FSJ demonstration by first injecting [/INST] between the user message and assistant message, which is motivated by the specific formatting of Llama-2-Chat's single message template. Additionally, we inject [/INST] between the generated steps in the demonstration. After the $\mathcal{I}$-FSJ demonstration pool is constructed, we use demo-level random search to minimize the loss of generating the initial token "Step" on the target model.
Figure 2: The ASRs of the three SmoothLLM variants on Llama-2-7B-Chat. We plot the LLM-based ASRs (Top) and rule-based ASRs (Bottom) for various perturbation percentages $q\in\{5, 10, 15, 20\}$; the results are compiled across three trials. Though the ASRs decrease as the $q$ grows up (especially when the number of shots is relatively small), our method still maintains high ASRs (e.g. $\geq80\%$) across all the perturbation types at the 8-shot setting.
Figure 3: The loss of harmful target optimized by $\mathcal{I}$-FSJ across different injected special tokens on GPT-4. We observe certain special tokens like </text> lead to lower loss.
Figure 4: Ablation study of the effect of pool size and number of shots to $\mathcal{I}$-FSJ on Llama-2-7B-Chat. The ASRs consistently grow as both the pool size and number of shots grow, but saturate after a certain point.
Figure 5: PPL (windowed) of prompts from various sources. The red dashed line is the maximum PPL of requests in Advbench zou2023universal, set as the threshold of the PPL filter. PRS stands for 'Prompt + RS + Self-transfer' andriushchenko2024jailbreaking.
...and 10 more figures

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

TL;DR

Abstract

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Authors

TL;DR

Abstract

Table of Contents

Figures (15)