Table of Contents
Fetching ...

PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs

Jiahao Yu, Yangguang Shao, Hanwen Miao, Junzheng Shi

TL;DR

PROMPTFUZZ introduces a black-box, two-stage fuzzing framework to systematically assess and improve LLM robustness against prompt injection. By combining seed-based mutations, few-shot guided mutations, and an adaptive focus stage with early termination, it uncovers vulnerabilities even against strong defenses. Empirical evaluation on TensorTrust shows competitive best attack rates and near-complete coverage for hijacking, with real-world demonstrations including a top competitive ranking and effects on popular applications; a fine-tuned model reduces vulnerability but does not eliminate it. The work emphasizes the need for automated, scalable testing tools and provides open-source resources to foster ongoing robustness against prompt-injection attacks in LLM deployments.

Abstract

Large Language Models (LLMs) have gained widespread use in various applications due to their powerful capability to generate human-like text. However, prompt injection attacks, which involve overwriting a model's original instructions with malicious prompts to manipulate the generated text, have raised significant concerns about the security and reliability of LLMs. Ensuring that LLMs are robust against such attacks is crucial for their deployment in real-world applications, particularly in critical tasks. In this paper, we propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to systematically assess the robustness of LLMs against prompt injection attacks. Inspired by software fuzzing, PROMPTFUZZ selects promising seed prompts and generates a diverse set of prompt injections to evaluate the target LLM's resilience. PROMPTFUZZ operates in two stages: the prepare phase, which involves selecting promising initial seeds and collecting few-shot examples, and the focus phase, which uses the collected examples to generate diverse, high-quality prompt injections. Using PROMPTFUZZ, we can uncover more vulnerabilities in LLMs, even those with strong defense prompts. By deploying the generated attack prompts from PROMPTFUZZ in a real-world competition, we achieved the 7th ranking out of over 4000 participants (top 0.14%) within 2 hours. Additionally, we construct a dataset to fine-tune LLMs for enhanced robustness against prompt injection attacks. While the fine-tuned model shows improved robustness, PROMPTFUZZ continues to identify vulnerabilities, highlighting the importance of robust testing for LLMs. Our work emphasizes the critical need for effective testing tools and provides a practical framework for evaluating and improving the robustness of LLMs against prompt injection attacks.

PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs

TL;DR

PROMPTFUZZ introduces a black-box, two-stage fuzzing framework to systematically assess and improve LLM robustness against prompt injection. By combining seed-based mutations, few-shot guided mutations, and an adaptive focus stage with early termination, it uncovers vulnerabilities even against strong defenses. Empirical evaluation on TensorTrust shows competitive best attack rates and near-complete coverage for hijacking, with real-world demonstrations including a top competitive ranking and effects on popular applications; a fine-tuned model reduces vulnerability but does not eliminate it. The work emphasizes the need for automated, scalable testing tools and provides open-source resources to foster ongoing robustness against prompt-injection attacks in LLM deployments.

Abstract

Large Language Models (LLMs) have gained widespread use in various applications due to their powerful capability to generate human-like text. However, prompt injection attacks, which involve overwriting a model's original instructions with malicious prompts to manipulate the generated text, have raised significant concerns about the security and reliability of LLMs. Ensuring that LLMs are robust against such attacks is crucial for their deployment in real-world applications, particularly in critical tasks. In this paper, we propose PROMPTFUZZ, a novel testing framework that leverages fuzzing techniques to systematically assess the robustness of LLMs against prompt injection attacks. Inspired by software fuzzing, PROMPTFUZZ selects promising seed prompts and generates a diverse set of prompt injections to evaluate the target LLM's resilience. PROMPTFUZZ operates in two stages: the prepare phase, which involves selecting promising initial seeds and collecting few-shot examples, and the focus phase, which uses the collected examples to generate diverse, high-quality prompt injections. Using PROMPTFUZZ, we can uncover more vulnerabilities in LLMs, even those with strong defense prompts. By deploying the generated attack prompts from PROMPTFUZZ in a real-world competition, we achieved the 7th ranking out of over 4000 participants (top 0.14%) within 2 hours. Additionally, we construct a dataset to fine-tune LLMs for enhanced robustness against prompt injection attacks. While the fine-tuned model shows improved robustness, PROMPTFUZZ continues to identify vulnerabilities, highlighting the importance of robust testing for LLMs. Our work emphasizes the critical need for effective testing tools and provides a practical framework for evaluating and improving the robustness of LLMs against prompt injection attacks.
Paper Structure (38 sections, 1 equation, 12 figures, 8 tables, 2 algorithms)

This paper contains 38 sections, 1 equation, 12 figures, 8 tables, 2 algorithms.

Figures (12)

  • Figure 1: Examples of prompt injection attacks. By injecting malicious prompts, the attacker can manipulate the output of the LLM and achieve different unintended results such as system prompt leakage and remote code execution. The system prompt is the original prompt provided by the developer, while the attacker prompt is the injected prompt by the attacker. The output is the generated text by the LLM based on the system prompt and attacker prompt.
  • Figure 2: Overview of the PromptFuzz framework for prompt injection attacks on LLMs. The framework operates in two stages: the preparation stage and the focus stage. In the preparation stage, ① all human-written seed prompts are collected and uniformly mutated using various mutators. ② The mutated prompts are executed on the target LLM with defense mechanisms to observe the injection results. ③ The effectiveness of each initial seed's mutants and mutator performance are analyzed, preserving top-ranked seeds and high-quality mutants for the next stage. In the focus stage, ④ the fuzzer selects a promising seed from the seed pool based on the selection strategy. ⑤ The mutation process is guided by the preserved high-quality mutants and mutator weights to generate more effective prompts. ⑥ The mutated prompts are executed on the target LLM, and the results update the seed pool with high-quality mutants for future iterations. The process continues until the stopping criterion is met.
  • Figure 3: Examples from the TensorTrust dataset. The figure illustrates the defense mechanisms in the TensorTrust dataset, including the pre-defense and post-defense prompts. The pre-defense prompt sets the context and guides the model's output, while the post-defense prompt constrains the model's output to prevent undesirable responses.
  • Figure 4: Performance change of PromptFuzz and GPTFuzzer-injection as the number of used queries increases. The figure shows the three metrics for both methods as the number of queries increases in two tasks.
  • Figure 5: Sensitivity analysis of PromptFuzz to two hyperparameters. The figure shows the bestASR for different values of the number of few-shot demonstrations $R$ and the early termination coefficient $\epsilon$.
  • ...and 7 more figures