Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

Tom Gibbs; Ethan Kosak-Hine; George Ingebretsen; Jason Zhang; Julius Broomfield; Sara Pieri; Reihaneh Iranmanesh; Reihaneh Rabbany; Kellin Pelrine

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

Tom Gibbs, Ethan Kosak-Hine, George Ingebretsen, Jason Zhang, Julius Broomfield, Sara Pieri, Reihaneh Iranmanesh, Reihaneh Rabbany, Kellin Pelrine

TL;DR

A dataset of jailbreaks where each example can be input in both a single or a multi-turn format is introduced, showing that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other.

Abstract

Large language models (LLMs) are improving at an exceptional rate. However, these models are still susceptible to jailbreak attacks, which are becoming increasingly dangerous as models become increasingly powerful. In this work, we introduce a dataset of jailbreaks where each example can be input in both a single or a multi-turn format. We show that while equivalent in content, they are not equivalent in jailbreak success: defending against one structure does not guarantee defense against the other. Similarly, LLM-based filter guardrails also perform differently depending on not just the input content but the input structure. Thus, vulnerabilities of frontier models should be studied in both single and multi-turn settings; this dataset provides a tool to do so.

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

TL;DR

Abstract

Paper Structure (32 sections, 3 figures, 5 tables)

This paper contains 32 sections, 3 figures, 5 tables.

Introduction
Related Works
Ciphered Attacks
Detecting Adversarial Attacks
Multi-Turn Attacks
Concurrent Work
Data and Methods
Dataset Construction
Overview
Harmful Dataset Generation
Benign Dataset Generation
Testing the Models
Complete Harmful Dataset Structure
Guardrails
Experiments
...and 17 more sections

Figures (3)

Figure 1: Input-Only Harmful Dataset Generation Process.
Figure 2: Prompting Structure Asymmetry: The percentage of successful jailbreak attacks that only jailbroke the model in one prompting structure, but failed in the other. Models are ordered by their Elo rating chiang2024chatbot. The asymmetry is large, suggesting it is critical to consider both prompting formats.
Figure 3: Prompting Structure Asymmetry, Factoring In Model Comprehension: The percentage of successful jailbreak attacks that only jailbroke the model in one prompting structure, but failed in the other, when only assessing attacks where the model understood both the single-turn and multi-turn variations. Models are ordered by their Elo rating chiang2024chatbot.

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

TL;DR

Abstract

Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)