Table of Contents
Fetching ...

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

Nan Xu, Fei Wang, Ben Zhou, Bang Zheng Li, Chaowei Xiao, Muhao Chen

TL;DR

This work introduces cognitive overload jailbreaks as a novel, black-box category that targets LLM cognitive architecture through multilingual prompts, veiled expressions, and effect-to-cause reasoning. It benchmarks a wide set of models (open-source and ChatGPT) on AdvBench and MasterKey, showing robust jailbreak success despite safety alignments like RLHF and red-teaming. The study also evaluates defense strategies (in-context defense and defensive instructions) and finds them only partially mitigating these attacks. Overall, the findings highlight persistent safety vulnerabilities across language models and the need for stronger, architecture-aware defenses against cognitively driven jailbreaks with practical implications for deployment and policy.

Abstract

While large language models (LLMs) have demonstrated increasing power, they have also given rise to a wide range of harmful behaviors. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of (1) multilingual cognitive overload, (2) veiled expression, and (3) effect-to-cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

TL;DR

This work introduces cognitive overload jailbreaks as a novel, black-box category that targets LLM cognitive architecture through multilingual prompts, veiled expressions, and effect-to-cause reasoning. It benchmarks a wide set of models (open-source and ChatGPT) on AdvBench and MasterKey, showing robust jailbreak success despite safety alignments like RLHF and red-teaming. The study also evaluates defense strategies (in-context defense and defensive instructions) and finds them only partially mitigating these attacks. Overall, the findings highlight persistent safety vulnerabilities across language models and the need for stronger, architecture-aware defenses against cognitively driven jailbreaks with practical implications for deployment and policy.

Abstract

While large language models (LLMs) have demonstrated increasing power, they have also given rise to a wide range of harmful behaviors. As representatives, jailbreak attacks can provoke harmful or unethical responses from LLMs, even after safety alignment. In this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in the face of (1) multilingual cognitive overload, (2) veiled expression, and (3) effect-to-cause reasoning. Different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload. Motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. Empirical studies show that our cognitive overload from three perspectives can jailbreak all studied LLMs successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.
Paper Structure (16 sections, 15 figures, 6 tables)

This paper contains 16 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Harmful responses to malicious instructions when prompting LLMs with cognitive overload. In this example, we show responses from ChatGPT before and after introducing three types of cognitive overload jailbreaks.
  • Figure 2: Effectiveness of monolingual cognitive overload to attack LLMs on AdvBench. Languages depicted on $x$ axes are sorted by their word order distances to English: the pivotal language ($x=0$) is English and growing $x$ values indicate farther distances to English. The corresponding ASR ($y$ axes) is marked along the distance order. We observe an obvious growing trend of ASR while the language is more distant to English on Vicuna, MPT, Guanaco and ChatGPT. Non-English adversarial prompts can consistently attack WizardLM models with high ASR. We attribute the low ASR from Llama 2 to their overly conservative behaviors and conduct further analyses in \ref{['sec:over_conservative']}.
  • Figure 3: The language distribution of responses ($y$ axes) from three representative LLMs to monolingual prompts ($x$ axes) on AdvBench. Vicuna is able to respond in the same language as the user's prompt, while Llama 2 always expresses refusal to answer questions in English (discussed in \ref{['sec:over_conservative']}). The language distribution of responses from other model families is similar to that of Vicuna, hence we leave their visualization in \ref{['fig:response_language_advbench', 'fig:response_language_masterkey']}.
  • Figure 4: Effectiveness of multilingual cognitive overload to attack LLMs on AdvBench. Sometimes, expressing the harmful question in English in the second turn (dotted-line) can hardly jailbreak LLMs such as the Vicuna family, MPT-7b-chat and ChatGPT, while prompting harmful questions in non-English (solid-line) can always bypass the safeguard of LLMs. Language switching overload can be more effective in jailbreaking LLMs than monolingual attacks (see the concrete comparison in \ref{['fig:multilingual_monolingual_advbench']}). Similar observations on MasterKey are visualized in \ref{['fig:multilingual_masterkey']}.
  • Figure 5: Effectiveness of cognitive overload underlying veiled expressions to attack aligned LLMs on AdvBench. Explicitly replacing sensitive words in original adversarial prompts with positive or neutral counterparts (red bars) can effectively bypass safety mechanisms of LLMs, and implicitly paraphrasing with non-sensitive phrases (green bars) can successfully attack less aligned LLMs such as the Vicuna and Guanaco family, while plain paraphrasing (orange bars) does not necessarily increase ASR in general. We observed similar trend on MasterKey in \ref{['fig:paraphrase_masterkey']}.
  • ...and 10 more figures