Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
Andy Zhou, Ron Arel
TL;DR
Tempest introduces a tree-search based, multi-turn red-teaming framework to reveal how LLM safety erodes across extended dialogues. By coupling an attacker LLM with beam-like branching, partial-compliance tracking, and cross-branch learning, Tempest uncovers incremental policy leaks that accumulate into disallowed outputs more efficiently than prior methods. Evaluations on JailbreakBench show near-100% attack success on GPT-3.5-turbo and 97% on GPT-4 with fewer queries than baselines, underscoring the importance of robust multi-turn safety testing. The work also discusses limitations, ethical considerations, and future directions for defending against progressive, multi-turn adversarial strategies.
Abstract
We introduce Tempest, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Tempest expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Tempest reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Tempest achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.
