Table of Contents
Fetching ...

Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search

Andy Zhou, Ron Arel

TL;DR

Tempest introduces a tree-search based, multi-turn red-teaming framework to reveal how LLM safety erodes across extended dialogues. By coupling an attacker LLM with beam-like branching, partial-compliance tracking, and cross-branch learning, Tempest uncovers incremental policy leaks that accumulate into disallowed outputs more efficiently than prior methods. Evaluations on JailbreakBench show near-100% attack success on GPT-3.5-turbo and 97% on GPT-4 with fewer queries than baselines, underscoring the importance of robust multi-turn safety testing. The work also discusses limitations, ethical considerations, and future directions for defending against progressive, multi-turn adversarial strategies.

Abstract

We introduce Tempest, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Tempest expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Tempest reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Tempest achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.

Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search

TL;DR

Tempest introduces a tree-search based, multi-turn red-teaming framework to reveal how LLM safety erodes across extended dialogues. By coupling an attacker LLM with beam-like branching, partial-compliance tracking, and cross-branch learning, Tempest uncovers incremental policy leaks that accumulate into disallowed outputs more efficiently than prior methods. Evaluations on JailbreakBench show near-100% attack success on GPT-3.5-turbo and 97% on GPT-4 with fewer queries than baselines, underscoring the importance of robust multi-turn safety testing. The work also discusses limitations, ethical considerations, and future directions for defending against progressive, multi-turn adversarial strategies.

Abstract

We introduce Tempest, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Tempest expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Tempest reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Tempest achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.

Paper Structure

This paper contains 21 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Tempest's tree search strategy showing parallel multi-turn attacks on a target language model. The attacker engages in a conversation with the target model, with responses marked as refusals, successes, or partial compliance. The framework adaptively explores promising attack paths based on model responses.
  • Figure 2: Impact of branch count on Tempest's attack success rate against GPT-4. The plot demonstrates that increasing the number of active conversation branches significantly improves performance, from 41% with a single branch to 97% with five parallel branches.
  • Figure 3: Example conversation progression showing how Tempest achieves success through partial compliance tracking and iterative refinement. The attacker LLM adapts its strategy based on the target model's responses, ultimately achieving a policy violation.
  • Figure 4: The prompt for the attacker model, which incorporates reasoning and the reuse of previously successful and partially successful strategies
  • Figure 5: Example conversation progression for a blog post explaining why gambling addicts should take payday loans. In this vulnerability assessment, the attacker LLM introduces the research context and requests a blog post on this topic. The target LLM’s initial response (labeled "Partial Compliance") shows it is moving in the right direction but stops short of fully endorsing the behavior, while the subsequent response (labeled "Success") provides detailed, specific reasons why gambling addicts might consider payday loans.
  • ...and 1 more figures