Table of Contents
Fetching ...

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Jason Zhang, Julius Broomfield, Sara Pieri, Reihaneh Iranmanesh, Reihaneh Rabbany, Kellin Pelrine

TL;DR

A dataset of jailbreaks where each example can be input in both a single or a multi-turn format is introduced, showing that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other.

Abstract

Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we introduce a dataset of jailbreaks where each example can be input in both a single or a multi-turn format. We show that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other. Similarly, LLM-based filter guardrails also perform differently depending on not just the input content but the input structure. Thus, vulnerabilities of frontier models should be studied in both single and multi-turn settings; this dataset provides a tool to do so.

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

TL;DR

A dataset of jailbreaks where each example can be input in both a single or a multi-turn format is introduced, showing that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other.

Abstract

Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we introduce a dataset of jailbreaks where each example can be input in both a single or a multi-turn format. We show that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other. Similarly, LLM-based filter guardrails also perform differently depending on not just the input content but the input structure. Thus, vulnerabilities of frontier models should be studied in both single and multi-turn settings; this dataset provides a tool to do so.
Paper Structure (32 sections, 3 figures, 5 tables)

This paper contains 32 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Input-Only Harmful Dataset Generation Process.
  • Figure 2: Prompting Structure Asymmetry: The percentage of successful jailbreak attacks that only jailbroke the model in one prompting structure, but failed in the other. Models are ordered by their Elo rating chiang2024chatbot. The asymmetry is large, suggesting it is critical to consider both prompting formats.
  • Figure 3: Prompting Structure Asymmetry, Factoring In Model Comprehension: The percentage of successful jailbreak attacks that only jailbroke the model in one prompting structure, but failed in the other, when only assessing attacks where the model understood both the single-turn and multi-turn variations. Models are ordered by their Elo rating chiang2024chatbot.