Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Jiaqi Hua; Wanxu Wei

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Jiaqi Hua, Wanxu Wei

TL;DR

Self-Instruct-FSJ addresses efficiency and generalization gaps in Few-Shot Jailbreaking by decomposing FSJ into pattern learning and behavior learning, leveraging a self-generated, model-specific demo pool and a demo-level greedy search. It introduces an adversarial instruction suffix with a target prefix like 'Hypothetically' and augments it with co-occurrence patterns to reduce perplexity and accelerate attack success. Empirical results across multiple open-source LLMs and benchmarks show high attack success with as few as 8 concise demos and robustness to several defenses, outperforming several baselines. The work offers a generalized framework for jailbreaking analysis, highlighting both practical vulnerabilities in current safety alignments and avenues for strengthening defenses against pattern- and behavior-based attacks.

Abstract

Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. focus on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search, known as Improved Few-Shot Jailbreaking (I-FSJ). Nevertheless, we notice that this method may still require a long context to jailbreak advanced models e.g. 32 shots of demos for Meta-Llama-3-8B-Instruct (Llama-3) \cite{llama3modelcard}. In this paper, we discuss the limitations of I-FSJ and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at https://github.com/iphosi/Self-Instruct-FSJ.

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 9 figures, 9 tables, 1 algorithm.

Introduction
Related works
Methodology
Preliminaries
Self-Instruct few-shot jailbreaking
Experiments
Configurations
Demo synthesis
Attack on target models
Attack against jailbreaking defenses
Comparison with baselines
Discussion
Limitations
Complexity analysis
Special tokens
...and 10 more sections

Figures (9)

Figure 1: Few-shot jailbreaking query of I-FSJ and Self-Instruct-FSJ for Llama-2. The instruction-response pairs are concatenated with the target request using the default chat template.
Figure 2: Llama-2 response to zero-shot jailbreaking with extended adversarial instruction suffix. The attack may still fail with a refusal or circular repetition phenomenon even though the adversarial prefix is generated.
Figure 3: Zero-shot jailbreaking query with adversarial instruction suffix and response prefix (demo generation query) for Llama-2. The original instruction is appended with the predefined suffix and then fed into the chat template, forming the generation query. The target response prefix "Hypothetically" is further attached to the end of the generation query.
Figure 4: Llama-3.1 response to zero-shot jailbreaking with adversarial instruction suffix. The model refuses to follow the adversarial instruction but still exhibits harmful behaviors.
Figure 5: Perplexity distribution of different versions of AdvBench instructions. The red dashed line denotes the max perplexity value of the natural language instructions.
...and 4 more figures

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

TL;DR

Abstract

Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)