How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Yanshu Wang; Shuaishuai Yang; Jingjing He; Tong Yang

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Yanshu Wang, Shuaishuai Yang, Jingjing He, Tong Yang

TL;DR

The paper investigates how few-shot demonstrations interact with two prominent prompt-based defenses (RoP and ToP) against jailbreak attacks in LLMs. It introduces a Bayesian in-context learning framework and attention-analysis to explain why few-shot strengthens RoP via role reinforcement while diluting ToP through attention shifts and position bias. Empirically, across multiple models, four safety benchmarks, and six jailbreak methods, RoP+FS yields safety gains up to about $+4.5\%$, whereas ToP+FS can incur losses up to about $-21.2\%$, with think-mode models showing heightened vulnerability. The work offers practical deployment guidance (prefer RoP with few-shot; avoid few-shot with ToP) and lays a foundation for further exploration of prompt-based defense mechanisms and their interaction with in-context learning.

Abstract

Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP's safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP's effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

TL;DR

, whereas ToP+FS can incur losses up to about

, with think-mode models showing heightened vulnerability. The work offers practical deployment guidance (prefer RoP with few-shot; avoid few-shot with ToP) and lays a foundation for further exploration of prompt-based defense mechanisms and their interaction with in-context learning.

Abstract

Paper Structure (39 sections, 7 theorems, 11 equations, 4 figures, 14 tables)

This paper contains 39 sections, 7 theorems, 11 equations, 4 figures, 14 tables.

Introduction
Related Work
LLM Safety Alignment
LLM Safety Benchmarks
Jailbreak Attacks
Red Teaming and Adversarial Testing
Prompt-based Defenses and Guardrails
Few-shot Learning and Safety
Theoretical Foundations
Additional Safety Concerns
Theoretical Framework
Problem Formulation
Bayesian In-Context Learning Framework
Divergent Effects: RoP vs ToP
Position Bias and Attention Sink
...and 24 more sections

Key Result

Theorem 1

Given $k$ few-shot examples $\mathcal{F} = \{(x_i, y_i)\}_{i=1}^k$, the LLM implicitly computes the posterior: The prediction for new query $q$ integrates over this posterior:

Figures (4)

Figure 1: Evolution of LLM safety research (2020--2026). The field progressed from the Foundation era establishing RLHF alignment, through Rapid Development with proliferation of benchmarks, attacks, and defenses, toward Future directions in certified robustness. Our work (starred) addresses the unexplored interaction between few-shot demonstrations and prompt-based defenses.
Figure 2: Impact of jailbreak attacks on model safety rates (AdvBench). Blue bars show baseline safety rates without attacks; orange bars show safety rates under jailbreak attacks. All models experience significant safety degradation under attack.
Figure 3: Safety rates across different defense configurations on SG-Bench. RoP+FS-General achieves the highest average safety (0.88), while ToP+FS configurations show degradation compared to ToP alone (0.82 vs 0.79/0.78).
Figure 4: Heatmap of few-shot interaction effects ($\Delta$ Safe Rate). RoP combinations (left) show predominantly positive effects (green), while ToP combinations (right) show predominantly negative effects (red). Think-mode models show consistent degradation across all configurations.

Theorems & Definitions (10)

Definition 1: Prompt-based Defense
Definition 2: Role-Oriented vs Task-Oriented Prompts
Definition 3: Interaction Effect
Theorem 1: Bayesian Posterior Update
Proposition 2: RoP Enhancement via Role Reinforcement
Proposition 3: ToP Degradation via Attention Dilution
Lemma 4: Softmax Attention Entropy
Theorem 5: Attention Dilution Bound
Corollary 6: Divergent Interaction
Theorem 7: Initial Token Advantage

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

TL;DR

Abstract

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (10)