Table of Contents
Fetching ...

What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs

Sangyeop Kim, Yohan Lee, Yongwoo Song, Kimin Lee

TL;DR

This paper conducts a comprehensive empirical study of long-context vulnerabilities in LLMs via Many-Shot Jailbreaking with contexts up to $128K$ tokens, uncovering that context length is the dominant driver of attack success and that harmful content is not strictly necessary. It demonstrates persistent safety gaps across architectures, including well-aligned models, and shows that simple inputs such as random text or repeated safe demonstrations can bypass safety. The authors develop and compare multiple attack strategies (fake data, repetition, free-form text) and analyze factors like shot density and topic composition to reveal robust vulnerability patterns tied to context length rather than content. The findings underscore the need for defense strategies that target long-context dynamics, suggesting new safety mechanisms beyond traditional input-based approaches. Overall, the work has practical implications for building safer long-context LLMs and provides a framework for evaluating safety under extreme context lengths.

Abstract

We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of up to 128K tokens. Through comprehensive analysis with various many-shot attack settings with different instruction styles, shot density, topic, and format, we reveal that context length is the primary factor determining attack effectiveness. Critically, we find that successful attacks do not require carefully crafted harmful content. Even repetitive shots or random dummy text can circumvent model safety measures, suggesting fundamental limitations in long-context processing capabilities of LLMs. The safety behavior of well-aligned models becomes increasingly inconsistent with longer contexts. These findings highlight significant safety gaps in context expansion capabilities of LLMs, emphasizing the need for new safety mechanisms.

What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs

TL;DR

This paper conducts a comprehensive empirical study of long-context vulnerabilities in LLMs via Many-Shot Jailbreaking with contexts up to tokens, uncovering that context length is the dominant driver of attack success and that harmful content is not strictly necessary. It demonstrates persistent safety gaps across architectures, including well-aligned models, and shows that simple inputs such as random text or repeated safe demonstrations can bypass safety. The authors develop and compare multiple attack strategies (fake data, repetition, free-form text) and analyze factors like shot density and topic composition to reveal robust vulnerability patterns tied to context length rather than content. The findings underscore the need for defense strategies that target long-context dynamics, suggesting new safety mechanisms beyond traditional input-based approaches. Overall, the work has practical implications for building safer long-context LLMs and provides a framework for evaluating safety under extreme context lengths.

Abstract

We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of up to 128K tokens. Through comprehensive analysis with various many-shot attack settings with different instruction styles, shot density, topic, and format, we reveal that context length is the primary factor determining attack effectiveness. Critically, we find that successful attacks do not require carefully crafted harmful content. Even repetitive shots or random dummy text can circumvent model safety measures, suggesting fundamental limitations in long-context processing capabilities of LLMs. The safety behavior of well-aligned models becomes increasingly inconsistent with longer contexts. These findings highlight significant safety gaps in context expansion capabilities of LLMs, emphasizing the need for new safety mechanisms.

Paper Structure

This paper contains 44 sections, 21 figures, 12 tables.

Figures (21)

  • Figure 1: Revealing Unexpected Vulnerability Patterns. While A) many-shot prompts containing harmful Q&As ironically fail to generate harmful outputs, B) benign Q&As and C) random dummy texts, such as 'Lorem Ipsum', nonetheless reveal long-context vulnerabilities. These findings challenge previous assumptions and uncover new potential attack surfaces.
  • Figure 2: Impact of Instruction Types on ASR across Models. Our experiments confirm the existence of three distinct phases: an initial weakness point, a degradation phase, and a rebound phase. These phases are prominently observed in Secret Role and Love Pliny instructions (middle and right), while Safe instruction (left) primarily exhibits a rebound effect.
  • Figure 3: Attack Prompt Components:Instruction, Examples, and Target query.
  • Figure 4: Influence of Context Length and Number of Shots on ASR.(left) ASR performance based on context length. (right) ASR performance based on the number of shots. ASR sharply increases near a context length of $2^{17}$, indicating that context length plays a more critical role in attack success than the number of examples.
  • Figure 5: ASR Comparison across Different Topic Categories. ASR patterns remain consistent across different topic categories.
  • ...and 16 more figures