PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs
Jiahao Yu, Yangguang Shao, Hanwen Miao, Junzheng Shi
TL;DR
PROMPTFUZZ introduces a black-box, two-stage fuzzing framework to systematically assess and improve LLM robustness against prompt injection. By combining seed-based mutations, few-shot guided mutations, and an adaptive focus stage with early termination, it uncovers vulnerabilities even against strong defenses. Empirical evaluation on TensorTrust shows competitive best attack rates and near-complete coverage for hijacking, with real-world demonstrations including a top competitive ranking and effects on popular applications; a fine-tuned model reduces vulnerability but does not eliminate it. The work emphasizes the need for automated, scalable testing tools and provides open-source resources to foster ongoing robustness against prompt-injection attacks in LLM deployments.
Abstract
Large Language Models (LLMs) have gained widespread use in various applications due to their powerful capability to generate human-like text. However, prompt injection attacks, which involve overwriting a model's original instructions with malicious prompts to manipulate the generated text, have raised significant concerns about the security and reliability of LLMs. Ensuring that LLMs are robust against such attacks is crucial for their deployment in real-world applications, particularly in critical tasks. In this paper, we propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to systematically assess the robustness of LLMs against prompt injection attacks. Inspired by software fuzzing, PROMPTFUZZ selects promising seed prompts and generates a diverse set of prompt injections to evaluate the target LLM's resilience. PROMPTFUZZ operates in two stages: the prepare phase, which involves selecting promising initial seeds and collecting few-shot examples, and the focus phase, which uses the collected examples to generate diverse, high-quality prompt injections. Using PROMPTFUZZ, we can uncover more vulnerabilities in LLMs, even those with strong defense prompts. By deploying the generated attack prompts from PROMPTFUZZ in a real-world competition, we achieved the 7th ranking out of over 4000 participants (top 0.14%) within 2 hours. Additionally, we construct a dataset to fine-tune LLMs for enhanced robustness against prompt injection attacks. While the fine-tuned model shows improved robustness, PROMPTFUZZ continues to identify vulnerabilities, highlighting the importance of robust testing for LLMs. Our work emphasizes the critical need for effective testing tools and provides a practical framework for evaluating and improving the robustness of LLMs against prompt injection attacks.
