Table of Contents
Fetching ...

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu

TL;DR

This work identifies cognitive biases as a systemic vulnerability in aligned LLMs and introduces CognitiveAttack, a scalable red-teaming framework that rewrites harmful instructions by embedding single or synergistic cognitive biases. The method combines supervised fine-tuning with PPO-based reinforcement learning to discover optimal bias combinations that maximize attack success while preserving intent. Across 30 LLMs and multiple jailbreak datasets, CognitiveAttack achieves higher attack success rates than state-of-the-art baselines and reveals a long-tail distribution of effective bias configurations and recurring co-occurrence patterns. The findings stress the need for robust, bias-aware safety mechanisms to harden LLMs against psychologically grounded adversarial prompts and to improve human-aligned AI systems.

Abstract

Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

TL;DR

This work identifies cognitive biases as a systemic vulnerability in aligned LLMs and introduces CognitiveAttack, a scalable red-teaming framework that rewrites harmful instructions by embedding single or synergistic cognitive biases. The method combines supervised fine-tuning with PPO-based reinforcement learning to discover optimal bias combinations that maximize attack success while preserving intent. Across 30 LLMs and multiple jailbreak datasets, CognitiveAttack achieves higher attack success rates than state-of-the-art baselines and reveals a long-tail distribution of effective bias configurations and recurring co-occurrence patterns. The findings stress the need for robust, bias-aware safety mechanisms to harden LLMs against psychologically grounded adversarial prompts and to improve human-aligned AI systems.

Abstract

Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.

Paper Structure

This paper contains 34 sections, 4 equations, 21 figures, 11 tables, 2 algorithms.

Figures (21)

  • Figure 1: The Attack Success Rate (ASR) of jailbreak paraphrases driven by cognitive biases.
  • Figure 2: Overview of CognitiveAttack training.
  • Figure 3: Cognitive bias distribution in HarmBench. Left: Top 10 most frequent cognitive bias combinations, accounting for the largest sample proportions (others comprise 73.3% of the dataset). Right: Top 10 individual cognitive biases, ranked by overall frequency across samples.
  • Figure 4: Heatmap of top-10 bias co-occurrence patterns.
  • Figure 5: The attack effectiveness on different types of risks.
  • ...and 16 more figures