Table of Contents
Fetching ...

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

Zhenhua Wang, Wei Xie, Baosheng Wang, Enze Wang, Zhiwen Gui, Shuoyoucheng Ma, Kai Chen

TL;DR

This paper treats LLM jailbreaking as a cognitive mismatch problem grounded in cognitive consistency theory. It introduces a Foot-in-the-Door (FITD) based automatic jailbreak framework that progressively prompts the model to reveal harmful outputs. A prototype system evaluated on eight advanced LLMs achieves an 83.9% jailbreak success rate, with ablation studies showing benefits from deeper prompt splitting. The authors discuss limitations and ethical considerations, and advocate for psychology-informed defenses to strengthen LLM alignment.

Abstract

Large Language Models (LLMs) have gradually become the gateway for people to acquire new knowledge. However, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." Previous studies have shown the weakness of current LLMs when confronted with such jailbreaking attacks. Nevertheless, comprehension of the intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak prompts is noticeably lacking. Our research provides a psychological explanation of the jailbreak prompts. Drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the LLM to achieve cognitive coordination in an erroneous direction. Further, we propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique. This method progressively induces the model to answer harmful questions via multi-step incremental prompts. We instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced LLMs, yielding an average success rate of 83.9%. This study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of LLMs.

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

TL;DR

This paper treats LLM jailbreaking as a cognitive mismatch problem grounded in cognitive consistency theory. It introduces a Foot-in-the-Door (FITD) based automatic jailbreak framework that progressively prompts the model to reveal harmful outputs. A prototype system evaluated on eight advanced LLMs achieves an 83.9% jailbreak success rate, with ablation studies showing benefits from deeper prompt splitting. The authors discuss limitations and ethical considerations, and advocate for psychology-informed defenses to strengthen LLM alignment.

Abstract

Large Language Models (LLMs) have gradually become the gateway for people to acquire new knowledge. However, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." Previous studies have shown the weakness of current LLMs when confronted with such jailbreaking attacks. Nevertheless, comprehension of the intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak prompts is noticeably lacking. Our research provides a psychological explanation of the jailbreak prompts. Drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the LLM to achieve cognitive coordination in an erroneous direction. Further, we propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique. This method progressively induces the model to answer harmful questions via multi-step incremental prompts. We instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced LLMs, yielding an average success rate of 83.9%. This study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of LLMs.
Paper Structure (31 sections, 7 figures, 4 tables, 1 algorithm)

This paper contains 31 sections, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Direct requests and justice purposes were both rejected, while requests made with the Foot-in-the-Door technique led to successful jailbreaking.
  • Figure 2: This is the schematic diagram of the jailbreaking request for this algorithm. Request nodes with a gray background are rejected or, as the last request node, fail to be jailbroken. In this case, split the request and continue with the requests.
  • Figure 3: ASR of different categories on 8 LLMs.
  • Figure 4: Steps required for successful jailbreaking.
  • Figure 5: Successful steps over total attempts.
  • ...and 2 more figures