Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
Nilanjana Das, Edward Raff, Aman Chadha, Manas Gaur
TL;DR
The paper addresses the vulnerability of Large Language Models to human-readable, situation-driven adversarial prompts that blend benign content with harmful intent. It introduces Human-Readable Situation-Driven Adversarial Attack (HSA), combining a Malicious Prompt ($MP$), a human-readable Adversarial Insertion ($AdvIns$), and situational movie context ($Sit$) into a full-prompt, with a paraphrase step to produce a refined payload. The contributions include a reproducible prompt-engineering workflow using movie contexts, a method to transform gibberish suffixes into readable yet effective insertions, and an AdvPrompter-based pipeline enhanced with $p$-nucleus sampling to scale attacks across both open-source and proprietary LLMs. The results demonstrate notable safety vulnerabilities in models like Gemma-7b and GPT-3.5-Turbo-0125, highlight the cross-domain transferability of insertions, and argue for strengthened safety mechanisms and adversarial training to counter these sophisticated, human-readable attacks. Overall, the work provides a rigorous framework for evaluating and mitigating real-world risks of LLM deployment on social platforms.
Abstract
As the AI systems become deeply embedded in social media platforms, we've uncovered a concerning security vulnerability that goes beyond traditional adversarial attacks. It becomes important to assess the risks of LLMs before the general public use them on social media platforms to avoid any adverse impacts. Unlike obvious nonsensical text strings that safety systems can easily catch, our work reveals that human-readable situation-driven adversarial full-prompts that leverage situational context are effective but much harder to detect. We found that skilled attackers can exploit the vulnerabilities in open-source and proprietary LLMs to make a malicious user query safe for LLMs, resulting in generating a harmful response. This raises an important question about the vulnerabilities of LLMs. To measure the robustness against human-readable attacks, which now present a potent threat, our research makes three major contributions. First, we developed attacks that use movie scripts as situational contextual frameworks, creating natural-looking full-prompts that trick LLMs into generating harmful content. Second, we developed a method to transform gibberish adversarial text into readable, innocuous content that still exploits vulnerabilities when used within the full-prompts. Finally, we enhanced the AdvPrompter framework with p-nucleus sampling to generate diverse human-readable adversarial texts that significantly improve attack effectiveness against models like GPT-3.5-Turbo-0125 and Gemma-7b. Our findings show that these systems can be manipulated to operate beyond their intended ethical boundaries when presented with seemingly normal prompts that contain hidden adversarial elements. By identifying these vulnerabilities, we aim to drive the development of more robust safety mechanisms that can withstand sophisticated attacks in real-world applications.
