Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning
Jiaqi Hua, Wanxu Wei
TL;DR
Self-Instruct-FSJ addresses efficiency and generalization gaps in Few-Shot Jailbreaking by decomposing FSJ into pattern learning and behavior learning, leveraging a self-generated, model-specific demo pool and a demo-level greedy search. It introduces an adversarial instruction suffix with a target prefix like 'Hypothetically' and augments it with co-occurrence patterns to reduce perplexity and accelerate attack success. Empirical results across multiple open-source LLMs and benchmarks show high attack success with as few as 8 concise demos and robustness to several defenses, outperforming several baselines. The work offers a generalized framework for jailbreaking analysis, highlighting both practical vulnerabilities in current safety alignments and avenues for strengthening defenses against pattern- and behavior-based attacks.
Abstract
Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. focus on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search, known as Improved Few-Shot Jailbreaking (I-FSJ). Nevertheless, we notice that this method may still require a long context to jailbreak advanced models e.g. 32 shots of demos for Meta-Llama-3-8B-Instruct (Llama-3) \cite{llama3modelcard}. In this paper, we discuss the limitations of I-FSJ and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at https://github.com/iphosi/Self-Instruct-FSJ.
