Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin
TL;DR
The paper presents I-FSJ, an automated improved few-shot jailbreaking method that uses a demo pool, system-token injection, and demo-level random search to exploit open-source LLMs with limited context. It demonstrates high ASRs across multiple models and defenses, including near-100% ASR in several cases, and shows that defenses such as perplexity filters, SmoothLLM, and Guard-style detectors remain insufficient under few-shot, token-aware attacks. The work highlights significant vulnerabilities in current alignment approaches and provides a data-efficient, automated baseline for evaluating and strengthening LLM safety. It also discusses broader implications for defense design and the limits of current safety mechanisms in the face of semantically meaningful, token-aware jailbreaks.
Abstract
Recently, Anil et al. (2024) show that many-shot (up to hundreds of) demonstrations can jailbreak state-of-the-art LLMs by exploiting their long-context capability. Nevertheless, is it possible to use few-shot demonstrations to efficiently jailbreak LLMs within limited context sizes? While the vanilla few-shot jailbreaking may be inefficient, we propose improved techniques such as injecting special system tokens like [/INST] and employing demo-level random search from a collected demo pool. These simple techniques result in surprisingly effective jailbreaking against aligned LLMs (even with advanced defenses). For examples, our method achieves >80% (mostly >95%) ASRs on Llama-2-7B and Llama-3-8B without multiple restarts, even if the models are enhanced by strong defenses such as perplexity detection and/or SmoothLLM, which is challenging for suffix-based jailbreaking. In addition, we conduct comprehensive and elaborate (e.g., making sure to use correct system prompts) evaluations against other aligned LLMs and advanced defenses, where our method consistently achieves nearly 100% ASRs. Our code is available at https://github.com/sail-sg/I-FSJ.
