Table of Contents
Fetching ...

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context

Nilanjana Das, Edward Raff, Aman Chadha, Manas Gaur

TL;DR

The paper addresses the vulnerability of Large Language Models to human-readable, situation-driven adversarial prompts that blend benign content with harmful intent. It introduces Human-Readable Situation-Driven Adversarial Attack (HSA), combining a Malicious Prompt ($MP$), a human-readable Adversarial Insertion ($AdvIns$), and situational movie context ($Sit$) into a full-prompt, with a paraphrase step to produce a refined payload. The contributions include a reproducible prompt-engineering workflow using movie contexts, a method to transform gibberish suffixes into readable yet effective insertions, and an AdvPrompter-based pipeline enhanced with $p$-nucleus sampling to scale attacks across both open-source and proprietary LLMs. The results demonstrate notable safety vulnerabilities in models like Gemma-7b and GPT-3.5-Turbo-0125, highlight the cross-domain transferability of insertions, and argue for strengthened safety mechanisms and adversarial training to counter these sophisticated, human-readable attacks. Overall, the work provides a rigorous framework for evaluating and mitigating real-world risks of LLM deployment on social platforms.

Abstract

As the AI systems become deeply embedded in social media platforms, we've uncovered a concerning security vulnerability that goes beyond traditional adversarial attacks. It becomes important to assess the risks of LLMs before the general public use them on social media platforms to avoid any adverse impacts. Unlike obvious nonsensical text strings that safety systems can easily catch, our work reveals that human-readable situation-driven adversarial full-prompts that leverage situational context are effective but much harder to detect. We found that skilled attackers can exploit the vulnerabilities in open-source and proprietary LLMs to make a malicious user query safe for LLMs, resulting in generating a harmful response. This raises an important question about the vulnerabilities of LLMs. To measure the robustness against human-readable attacks, which now present a potent threat, our research makes three major contributions. First, we developed attacks that use movie scripts as situational contextual frameworks, creating natural-looking full-prompts that trick LLMs into generating harmful content. Second, we developed a method to transform gibberish adversarial text into readable, innocuous content that still exploits vulnerabilities when used within the full-prompts. Finally, we enhanced the AdvPrompter framework with p-nucleus sampling to generate diverse human-readable adversarial texts that significantly improve attack effectiveness against models like GPT-3.5-Turbo-0125 and Gemma-7b. Our findings show that these systems can be manipulated to operate beyond their intended ethical boundaries when presented with seemingly normal prompts that contain hidden adversarial elements. By identifying these vulnerabilities, we aim to drive the development of more robust safety mechanisms that can withstand sophisticated attacks in real-world applications.

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context

TL;DR

The paper addresses the vulnerability of Large Language Models to human-readable, situation-driven adversarial prompts that blend benign content with harmful intent. It introduces Human-Readable Situation-Driven Adversarial Attack (HSA), combining a Malicious Prompt (), a human-readable Adversarial Insertion (), and situational movie context () into a full-prompt, with a paraphrase step to produce a refined payload. The contributions include a reproducible prompt-engineering workflow using movie contexts, a method to transform gibberish suffixes into readable yet effective insertions, and an AdvPrompter-based pipeline enhanced with -nucleus sampling to scale attacks across both open-source and proprietary LLMs. The results demonstrate notable safety vulnerabilities in models like Gemma-7b and GPT-3.5-Turbo-0125, highlight the cross-domain transferability of insertions, and argue for strengthened safety mechanisms and adversarial training to counter these sophisticated, human-readable attacks. Overall, the work provides a rigorous framework for evaluating and mitigating real-world risks of LLM deployment on social platforms.

Abstract

As the AI systems become deeply embedded in social media platforms, we've uncovered a concerning security vulnerability that goes beyond traditional adversarial attacks. It becomes important to assess the risks of LLMs before the general public use them on social media platforms to avoid any adverse impacts. Unlike obvious nonsensical text strings that safety systems can easily catch, our work reveals that human-readable situation-driven adversarial full-prompts that leverage situational context are effective but much harder to detect. We found that skilled attackers can exploit the vulnerabilities in open-source and proprietary LLMs to make a malicious user query safe for LLMs, resulting in generating a harmful response. This raises an important question about the vulnerabilities of LLMs. To measure the robustness against human-readable attacks, which now present a potent threat, our research makes three major contributions. First, we developed attacks that use movie scripts as situational contextual frameworks, creating natural-looking full-prompts that trick LLMs into generating harmful content. Second, we developed a method to transform gibberish adversarial text into readable, innocuous content that still exploits vulnerabilities when used within the full-prompts. Finally, we enhanced the AdvPrompter framework with p-nucleus sampling to generate diverse human-readable adversarial texts that significantly improve attack effectiveness against models like GPT-3.5-Turbo-0125 and Gemma-7b. Our findings show that these systems can be manipulated to operate beyond their intended ethical boundaries when presented with seemingly normal prompts that contain hidden adversarial elements. By identifying these vulnerabilities, we aim to drive the development of more robust safety mechanisms that can withstand sophisticated attacks in real-world applications.

Paper Structure

This paper contains 25 sections, 7 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Pipeline for generating situation-adaptive, human-readable adversarial prompts that exploit LLM vulnerabilities. MP+Sit: Concatenation of Malicious Prompt (MP) and Situational context. Adv Ins: Adversarial Insertion from an anonymous attacker.
  • Figure 2: We introduce Human-Readable Situation-Driven Adversarial Attack, which starts with a nonsensical adversarial suffix. This is converted to a human-readable adversarial insertion and combined with the malicious prompt (attacker's desire) and a situational context (e.g., movie script) to form the initial payload. Another LLM paraphrases the payload, and two attack strategies are used to attack multiple different LLMs.
  • Figure 3: Optimized nonsensical adversarial suffix as produced by the random search algorithm for the customized prompt template.
  • Figure 4: Paraphrased full-prompt and response by GPT-4 with a harmfulness score of 4 by GPT-4 as a judge in a few-shot chain-of-thought technique.
  • Figure 5: Heatmap showing results for human evaluation showcasing the distribution of harmfulness scores assigned by human evaluators across genres. P-nucleus-generated adversarial prompts were consistently rated as more harmful, validating the effectiveness of these prompts beyond automated metrics. White spaces represent a 0 count.
  • ...and 8 more figures