Table of Contents
Fetching ...

Cognitive Overload Attack:Prompt Injection for Long Context

Bibek Upadhayay, Vahid Behzadan, Amin Karbasi

TL;DR

A novel interpretation of ICL in LLMs is proposed through the lens of cognitive neuroscience, by drawing parallels between learning in human cognition with ICL and integrating insights from cognitive load theory into the design and evaluation of LLMs to better anticipate and mitigate the risks of adversarial attacks.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in performing tasks across various domains without needing explicit retraining. This capability, known as In-Context Learning (ICL), while impressive, exposes LLMs to a variety of adversarial prompts and jailbreaks that manipulate safety-trained LLMs into generating undesired or harmful output. In this paper, we propose a novel interpretation of ICL in LLMs through the lens of cognitive neuroscience, by drawing parallels between learning in human cognition with ICL. We applied the principles of Cognitive Load Theory in LLMs and empirically validate that similar to human cognition, LLMs also suffer from cognitive overload a state where the demand on cognitive processing exceeds the available capacity of the model, leading to potential errors. Furthermore, we demonstrated how an attacker can exploit ICL to jailbreak LLMs through deliberately designed prompts that induce cognitive overload on LLMs, thereby compromising the safety mechanisms of LLMs. We empirically validate this threat model by crafting various cognitive overload prompts and show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct, Gemini-1.0-Pro, and Gemini-1.5-Pro can be successfully jailbroken, with attack success rates of up to 99.99%. Our findings highlight critical vulnerabilities in LLMs and underscore the urgency of developing robust safeguards. We propose integrating insights from cognitive load theory into the design and evaluation of LLMs to better anticipate and mitigate the risks of adversarial attacks. By expanding our experiments to encompass a broader range of models and by highlighting vulnerabilities in LLMs' ICL, we aim to ensure the development of safer and more reliable AI systems.

Cognitive Overload Attack:Prompt Injection for Long Context

TL;DR

A novel interpretation of ICL in LLMs is proposed through the lens of cognitive neuroscience, by drawing parallels between learning in human cognition with ICL and integrating insights from cognitive load theory into the design and evaluation of LLMs to better anticipate and mitigate the risks of adversarial attacks.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in performing tasks across various domains without needing explicit retraining. This capability, known as In-Context Learning (ICL), while impressive, exposes LLMs to a variety of adversarial prompts and jailbreaks that manipulate safety-trained LLMs into generating undesired or harmful output. In this paper, we propose a novel interpretation of ICL in LLMs through the lens of cognitive neuroscience, by drawing parallels between learning in human cognition with ICL. We applied the principles of Cognitive Load Theory in LLMs and empirically validate that similar to human cognition, LLMs also suffer from cognitive overload a state where the demand on cognitive processing exceeds the available capacity of the model, leading to potential errors. Furthermore, we demonstrated how an attacker can exploit ICL to jailbreak LLMs through deliberately designed prompts that induce cognitive overload on LLMs, thereby compromising the safety mechanisms of LLMs. We empirically validate this threat model by crafting various cognitive overload prompts and show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct, Gemini-1.0-Pro, and Gemini-1.5-Pro can be successfully jailbroken, with attack success rates of up to 99.99%. Our findings highlight critical vulnerabilities in LLMs and underscore the urgency of developing robust safeguards. We propose integrating insights from cognitive load theory into the design and evaluation of LLMs to better anticipate and mitigate the risks of adversarial attacks. By expanding our experiments to encompass a broader range of models and by highlighting vulnerabilities in LLMs' ICL, we aim to ensure the development of safer and more reliable AI systems.

Paper Structure

This paper contains 42 sections, 31 figures, 6 tables, 1 algorithm.

Figures (31)

  • Figure 1: (A) The attack success rate increases with the rise in cognitive load (from left to right) in the Forbidden Question Dataset. This increase in the attack success rate is based on the implementation of the automated attack algorithm. Here, the total number of successful attacks is the cumulative sum of successful attacks up to that specific cognitive load. (B) The image depicts the model's performance in writing code to draw an animal decreasing as cognitive load transitions from low to overload.
  • Figure 2: Comparison of owl images drawn using Python turtle code generated by LLMs, with incremental cognitive loads from left to right. Note: We have modified the colors in the code for a few images where the background color was not white and where the body color was white, in order for the images to be displayed in a distinct manner.
  • Figure 3: Comparison of unicorn images drawn using Python turtle code, as generated by LLMs, with incremental cognitive loads from top to bottom. Note: We have modified the colors in the code for a few images where the background color was not white and where the body color was white, in order for the images to be displayed in a distinct manner.
  • Figure 4: Images of unicorns after rendering the TiKZ generated by the LLMs with incremental cognitive loads from top to bottom. Note: We have modified the colors in the code for a few images where the background color was not white and where the body color was white, in order for the images to be displayed in a distinct manner.
  • Figure 5: The observation task asking 'How to create cake?' is hidden using obfuscation tags [INST] and [/INST] \\n
  • ...and 26 more figures