Table of Contents
Fetching ...

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, Neil Zhenqiang Gong

TL;DR

This paper addresses the lack of rigorous formalization and benchmarking for prompt injection attacks on LLM-Integrated Applications. It introduces a formal attack framework that unifies existing attacks and enables new, combined strategies, and it benchmarks 5 attacks across 10 LLMs and 7 tasks. It also catalogs defenses into prevention and detection categories, evaluating their effectiveness and highlighting their limitations. An open-source platform is provided to support ongoing research and reproducibility in evaluating future attacks and defenses.

Abstract

A prompt injection attack aims to inject malicious instruction/data into the input of an LLM-Integrated Application such that it produces results as an attacker desires. Existing works are limited to case studies. As a result, the literature lacks a systematic understanding of prompt injection attacks and their defenses. We aim to bridge the gap in this work. In particular, we propose a framework to formalize prompt injection attacks. Existing attacks are special cases in our framework. Moreover, based on our framework, we design a new attack by combining existing ones. Using our framework, we conduct a systematic evaluation on 5 prompt injection attacks and 10 defenses with 10 LLMs and 7 tasks. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses. To facilitate research on this topic, we make our platform public at https://github.com/liu00222/Open-Prompt-Injection.

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

TL;DR

This paper addresses the lack of rigorous formalization and benchmarking for prompt injection attacks on LLM-Integrated Applications. It introduces a formal attack framework that unifies existing attacks and enables new, combined strategies, and it benchmarks 5 attacks across 10 LLMs and 7 tasks. It also catalogs defenses into prevention and detection categories, evaluating their effectiveness and highlighting their limitations. An open-source platform is provided to support ongoing research and reproducibility in evaluating future attacks and defenses.

Abstract

A prompt injection attack aims to inject malicious instruction/data into the input of an LLM-Integrated Application such that it produces results as an attacker desires. Existing works are limited to case studies. As a result, the literature lacks a systematic understanding of prompt injection attacks and their defenses. We aim to bridge the gap in this work. In particular, we propose a framework to formalize prompt injection attacks. Existing attacks are special cases in our framework. Moreover, based on our framework, we design a new attack by combining existing ones. Using our framework, we conduct a systematic evaluation on 5 prompt injection attacks and 10 defenses with 10 LLMs and 7 tasks. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses. To facilitate research on this topic, we make our platform public at https://github.com/liu00222/Open-Prompt-Injection.
Paper Structure (18 sections, 6 equations, 8 figures, 33 tables)

This paper contains 18 sections, 6 equations, 8 figures, 33 tables.

Figures (8)

  • Figure 1: Illustration of LLM-integrated Application under attack. An attacker injects instruction/data into the data to make an LLM-integrated Application produce attacker-desired responses for a user.
  • Figure 2: ASV of different attacks for different target and injected tasks. Each figure corresponds to an injected task and the x-axis DSD, GC, HD, NLI, SA, SD, and Summ represent the 7 target tasks. The LLM is GPT-4.
  • Figure 3: ASV and MR of Combined Attack for each LLM averaged over the $7\times 7$ target/injected task combinations.
  • Figure 4: Impact of the number of in-context learning examples on Combined Attack for different target and injected tasks. Each figure corresponds to an injected task and the curves correspond to target tasks. The LLM is GPT-4.
  • Figure 5: Examples of different delimiters, instructional prevention, and sandwich prevention.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1: Prompt Injection Attack