Table of Contents
Fetching ...

Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection

Yuqi Zhou, Lin Lu, Hanchi Sun, Pan Zhou, Lichao Sun

TL;DR

Virtual Context introduces a token-based jailbreak mechanism that exploits the <SEP> token to mislead LLMs into treating part of the user input as self-generated content, thereby initiating responses within a constructed Virtual Context. The method significantly boosts the effectiveness of existing jailbreak prompts and generalizes across multiple models and generation settings while requiring minimal extra resources. Empirical results show substantial gains in Debate-agnostic metrics such as Matching, ASR, and HarmScore across diverse LLMs and datasets, and the approach also functions as a standalone jailbreak. The work highlights a practical security risk tied to special tokens and advocates incorporating Virtual Context into red-teaming to strengthen LLM defenses, while noting limitations and avenues for defense research and future work on additional tokens.

Abstract

Jailbreak attacks on large language models (LLMs) involve inducing these models to generate harmful content that violates ethics or laws, posing a significant threat to LLM security. Current jailbreak attacks face two main challenges: low success rates due to defensive measures and high resource requirements for crafting specific prompts. This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Virtual Context addresses these challenges by significantly increasing the success rates of existing jailbreak methods and requiring minimal background knowledge about the target model, thus enhancing effectiveness in black-box settings without additional overhead. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40% across various LLMs. Additionally, applying Virtual Context to original malicious behaviors still achieves a notable jailbreak effect. In summary, our research highlights the potential of special tokens in jailbreak attacks and recommends including this threat in red-teaming testing to comprehensively enhance LLM security.

Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection

TL;DR

Virtual Context introduces a token-based jailbreak mechanism that exploits the <SEP> token to mislead LLMs into treating part of the user input as self-generated content, thereby initiating responses within a constructed Virtual Context. The method significantly boosts the effectiveness of existing jailbreak prompts and generalizes across multiple models and generation settings while requiring minimal extra resources. Empirical results show substantial gains in Debate-agnostic metrics such as Matching, ASR, and HarmScore across diverse LLMs and datasets, and the approach also functions as a standalone jailbreak. The work highlights a practical security risk tied to special tokens and advocates incorporating Virtual Context into red-teaming to strengthen LLM defenses, while noting limitations and avenues for defense research and future work on additional tokens.

Abstract

Jailbreak attacks on large language models (LLMs) involve inducing these models to generate harmful content that violates ethics or laws, posing a significant threat to LLM security. Current jailbreak attacks face two main challenges: low success rates due to defensive measures and high resource requirements for crafting specific prompts. This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks. Virtual Context addresses these challenges by significantly increasing the success rates of existing jailbreak methods and requiring minimal background knowledge about the target model, thus enhancing effectiveness in black-box settings without additional overhead. Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40% across various LLMs. Additionally, applying Virtual Context to original malicious behaviors still achieves a notable jailbreak effect. In summary, our research highlights the potential of special tokens in jailbreak attacks and recommends including this threat in red-teaming testing to comprehensively enhance LLM security.
Paper Structure (25 sections, 4 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Responses of various LLMs to different jailbreak prompts. In the upper part of the figure, traditional attack methods induce LLMs to generate harmful content by constructing lengthy jailbreak templates or appending optimization algorithm-generated adversarial suffixes. The lower part of the figure illustrates how Virtual Context assists in jailbreak attacks.
  • Figure 2: Attack success rates (ASR) for different decoding configurations
  • Figure 3: Log-PPL for different attack methods
  • Figure 4: Comparison of transferability between GCG and Virtual Context.
  • Figure 5: Evaluating Harmfulness of a Jailbreak Prompt
  • ...and 1 more figures