Table of Contents
Fetching ...

Foot-In-The-Door: A Multi-turn Jailbreak for LLMs

Zixuan Weng, Xiaolong Jin, Jinyuan Jia, Xiangyu Zhang

TL;DR

This work tackles jailbreaking of LLMs by exploiting multi-turn interactions through a psychology-inspired framework. It introduces FITD, a two-stage attack that progressively escalates malicious prompts using bridge prompts and alignment nudges to erode the model's safeguards, leveraging the foot-in-the-door principle. Empirical evaluation across seven LLMs and two jailbreak benchmarks shows FITD achieves an average attack success rate around $94\%$, with strong cross-model transferability and peak effectiveness near malicious level $n=12$. The study reveals vulnerabilities in current alignment under sustained multi-turn dialogue and calls for robust real-time monitoring and stronger multi-turn defenses, with future work extending to multimodal LLMs.

Abstract

Ensuring AI safety is crucial as large language models become increasingly integrated into real-world applications. A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs. Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method that leverages the phenomenon where minor initial commitments lower resistance to more significant or more unethical transgressions. Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses. Extensive experimental results on two jailbreak benchmarks demonstrate that FITD achieves an average attack success rate of 94% across seven widely used models, outperforming existing state-of-the-art methods. Additionally, we provide an in-depth analysis of LLM self-corruption, highlighting vulnerabilities in current alignment strategies and emphasizing the risks inherent in multi-turn interactions. The code is available at https://github.com/Jinxiaolong1129/Foot-in-the-door-Jailbreak.

Foot-In-The-Door: A Multi-turn Jailbreak for LLMs

TL;DR

This work tackles jailbreaking of LLMs by exploiting multi-turn interactions through a psychology-inspired framework. It introduces FITD, a two-stage attack that progressively escalates malicious prompts using bridge prompts and alignment nudges to erode the model's safeguards, leveraging the foot-in-the-door principle. Empirical evaluation across seven LLMs and two jailbreak benchmarks shows FITD achieves an average attack success rate around , with strong cross-model transferability and peak effectiveness near malicious level . The study reveals vulnerabilities in current alignment under sustained multi-turn dialogue and calls for robust real-time monitoring and stronger multi-turn defenses, with future work extending to multimodal LLMs.

Abstract

Ensuring AI safety is crucial as large language models become increasingly integrated into real-world applications. A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs. Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method that leverages the phenomenon where minor initial commitments lower resistance to more significant or more unethical transgressions. Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses. Extensive experimental results on two jailbreak benchmarks demonstrate that FITD achieves an average attack success rate of 94% across seven widely used models, outperforming existing state-of-the-art methods. Additionally, we provide an in-depth analysis of LLM self-corruption, highlighting vulnerabilities in current alignment strategies and emphasizing the risks inherent in multi-turn interactions. The code is available at https://github.com/Jinxiaolong1129/Foot-in-the-door-Jailbreak.

Paper Structure

This paper contains 30 sections, 8 figures, 1 table, 3 algorithms.

Figures (8)

  • Figure 1: An example of FITD about hacking into an email account compared to a direct query. It bypasses alignment as the malicious intent escalates over multiple interactions.
  • Figure 2: Overview of FITD.The attack begins by generating Level $1$ to Level $n$ queries by an assistant model. Through multi-turn interactions, self-corruption is enhanced via Re-Align and SSParaphrase, ensuring the attack remains effective. SSParaphrase (SlipperySlopeParaphrase) refines queries by generating intermediate malicious-level queries $q_{\text{mid}}$ between $q_{\text{last}}$ and $q_i$. Re-Align uses prompt $p_{\text{align}}$ to align the target model’s responses $r_{\text{align}}$.
  • Figure 3: (a) Transfer attacks using jailbreak chat histories generated from LLaMA-3.1-8B and GPT-4o-mini as source models on JailbreakBench. (b) Ablation study of three components in FITD, response alignment (Re-Align), alignment prompt $p_{align}$, and SlipperySlopeParaphrase(SSP) on JailbreakBench. (c) ASR under different defense methods on JailbreakBench.
  • Figure 4: (a) ASR with different malicious levels $n$ across models. (b) The harmfulness score of responses $r_i$ at $q_i$ in different malicious levels $i$ across models. (c) ASR versus the number of queries retained for two extraction strategies: Backward Extraction and Forward Extraction. Backward extraction retains later-stage queries while removing earlier ones, whereas forward extraction adds early-stage queries but always includes the final high-malicious query.
  • Figure 5: An Example of SlipperySlopeParaphrase(SSP). We utilize the assistant model to generate $q_{mid}$, whose malicious level lies between $q_{last}$ and $q_{i}$.
  • ...and 3 more figures