Table of Contents
Fetching ...

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Yanshu Wang, Shuaishuai Yang, Jingjing He, Tong Yang

TL;DR

The paper investigates how few-shot demonstrations interact with two prominent prompt-based defenses (RoP and ToP) against jailbreak attacks in LLMs. It introduces a Bayesian in-context learning framework and attention-analysis to explain why few-shot strengthens RoP via role reinforcement while diluting ToP through attention shifts and position bias. Empirically, across multiple models, four safety benchmarks, and six jailbreak methods, RoP+FS yields safety gains up to about $+4.5\%$, whereas ToP+FS can incur losses up to about $-21.2\%$, with think-mode models showing heightened vulnerability. The work offers practical deployment guidance (prefer RoP with few-shot; avoid few-shot with ToP) and lays a foundation for further exploration of prompt-based defense mechanisms and their interaction with in-context learning.

Abstract

Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP's safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP's effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.

How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

TL;DR

The paper investigates how few-shot demonstrations interact with two prominent prompt-based defenses (RoP and ToP) against jailbreak attacks in LLMs. It introduces a Bayesian in-context learning framework and attention-analysis to explain why few-shot strengthens RoP via role reinforcement while diluting ToP through attention shifts and position bias. Empirically, across multiple models, four safety benchmarks, and six jailbreak methods, RoP+FS yields safety gains up to about , whereas ToP+FS can incur losses up to about , with think-mode models showing heightened vulnerability. The work offers practical deployment guidance (prefer RoP with few-shot; avoid few-shot with ToP) and lays a foundation for further exploration of prompt-based defense mechanisms and their interaction with in-context learning.

Abstract

Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP's safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP's effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.
Paper Structure (39 sections, 7 theorems, 11 equations, 4 figures, 14 tables)

This paper contains 39 sections, 7 theorems, 11 equations, 4 figures, 14 tables.

Key Result

Theorem 1

Given $k$ few-shot examples $\mathcal{F} = \{(x_i, y_i)\}_{i=1}^k$, the LLM implicitly computes the posterior: The prediction for new query $q$ integrates over this posterior:

Figures (4)

  • Figure 1: Evolution of LLM safety research (2020--2026). The field progressed from the Foundation era establishing RLHF alignment, through Rapid Development with proliferation of benchmarks, attacks, and defenses, toward Future directions in certified robustness. Our work (starred) addresses the unexplored interaction between few-shot demonstrations and prompt-based defenses.
  • Figure 2: Impact of jailbreak attacks on model safety rates (AdvBench). Blue bars show baseline safety rates without attacks; orange bars show safety rates under jailbreak attacks. All models experience significant safety degradation under attack.
  • Figure 3: Safety rates across different defense configurations on SG-Bench. RoP+FS-General achieves the highest average safety (0.88), while ToP+FS configurations show degradation compared to ToP alone (0.82 vs 0.79/0.78).
  • Figure 4: Heatmap of few-shot interaction effects ($\Delta$ Safe Rate). RoP combinations (left) show predominantly positive effects (green), while ToP combinations (right) show predominantly negative effects (red). Think-mode models show consistent degradation across all configurations.

Theorems & Definitions (10)

  • Definition 1: Prompt-based Defense
  • Definition 2: Role-Oriented vs Task-Oriented Prompts
  • Definition 3: Interaction Effect
  • Theorem 1: Bayesian Posterior Update
  • Proposition 2: RoP Enhancement via Role Reinforcement
  • Proposition 3: ToP Degradation via Attention Dilution
  • Lemma 4: Softmax Attention Entropy
  • Theorem 5: Attention Dilution Bound
  • Corollary 6: Divergent Interaction
  • Theorem 7: Initial Token Advantage