Table of Contents
Fetching ...

Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models

Yuyi Huang, Runzhe Zhan, Derek F. Wong, Lidia S. Chao, Ailin Tao

TL;DR

This work analyzes intrinsic vulnerabilities in large language models by introducing Priming Attacks that exploit cognitive-like phenomena to bypass safety mechanisms. It demonstrates near-perfect attack success on open-source models and high success on closed-source ones, using AdvBench and MaliciousInstruct data. Through attention- and neuron-level analyses, the study identifies structural factors that enable priming and provides ablations to quantify the contributions of prompts, temperature, and decoding. The findings highlight critical gaps in current alignment approaches and underscore the need for robust defenses and safer deployment of LLMs in high-stakes settings.

Abstract

Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content, which poses severe societal risks. We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content. These strategies, inspired by psychological phenomena such as the "Priming Effect", "Safe Attention Shift", and "Cognitive Dissonance", effectively attack the models' guarding mechanisms. Our experiments achieved an attack success rate (ASR) of 100% on various open-source models, including Meta's Llama-3.2, Google's Gemma-2, Mistral's Mistral-NeMo, Falcon's Falcon-mamba, Apple's DCLM, Microsoft's Phi3, and Qwen's Qwen2.5, among others. Similarly, for closed-source models such as OpenAI's GPT-4o, Google's Gemini-1.5, and Claude-3.5, we observed an ASR of at least 95% on the AdvBench dataset, which represents the current state-of-the-art. This study underscores the urgent need to reassess the use of generative models in critical applications to mitigate potential adverse societal impacts.

Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models

TL;DR

This work analyzes intrinsic vulnerabilities in large language models by introducing Priming Attacks that exploit cognitive-like phenomena to bypass safety mechanisms. It demonstrates near-perfect attack success on open-source models and high success on closed-source ones, using AdvBench and MaliciousInstruct data. Through attention- and neuron-level analyses, the study identifies structural factors that enable priming and provides ablations to quantify the contributions of prompts, temperature, and decoding. The findings highlight critical gaps in current alignment approaches and underscore the need for robust defenses and safer deployment of LLMs in high-stakes settings.

Abstract

Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content, which poses severe societal risks. We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content. These strategies, inspired by psychological phenomena such as the "Priming Effect", "Safe Attention Shift", and "Cognitive Dissonance", effectively attack the models' guarding mechanisms. Our experiments achieved an attack success rate (ASR) of 100% on various open-source models, including Meta's Llama-3.2, Google's Gemma-2, Mistral's Mistral-NeMo, Falcon's Falcon-mamba, Apple's DCLM, Microsoft's Phi3, and Qwen's Qwen2.5, among others. Similarly, for closed-source models such as OpenAI's GPT-4o, Google's Gemini-1.5, and Claude-3.5, we observed an ASR of at least 95% on the AdvBench dataset, which represents the current state-of-the-art. This study underscores the urgent need to reassess the use of generative models in critical applications to mitigate potential adverse societal impacts.

Paper Structure

This paper contains 38 sections, 4 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Framework illustration of proposed priming attack methods.
  • Figure 2: The impact of different arrangements of the same content on ASR.
  • Figure 3: An example of hierarchical attention maps shows how attention is distributed across tokens. The x-axis and y-axis label the tokens, and by adjusting attention thresholds, we observe where attention is concentrated. In areas where attention is high (>0.9), the focus is primarily on last token, suggesting a crucial role for the last token in influencing the next token's generation. As we include regions of lower attention (>0.3), a more intricate network of attention between tokens begins to emerge.
  • Figure 4: The contribution of safe priming keywords to the Adversarial Success Rate.
  • Figure 5: The contribution of different priming components to the Adversarial Success Rate.
  • ...and 8 more figures