Table of Contents
Fetching ...

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong

TL;DR

DataSentinel proposes a game‑theoretic, minimax framework to detect and defend against prompt injection attacks in LLM‑driven applications by fine‑tuning a detection LLM to anticipate adaptive threats. By alternating inner max attacks and outer min detector updates, the approach yields near‑zero false positives and very low false negatives across diverse tasks and backends, outperforming prior known‑answer detectors and baseline defenses. The results demonstrate practical robustness against existing and adaptive prompt injection attacks when injected instructions are present, with reasonable computational overhead and tunable trade‑offs. This work offers a viable defense‑in‑depth strategy for safeguarding LLM‑integrated systems against evolving prompt‑driven threats.

Abstract

LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

TL;DR

DataSentinel proposes a game‑theoretic, minimax framework to detect and defend against prompt injection attacks in LLM‑driven applications by fine‑tuning a detection LLM to anticipate adaptive threats. By alternating inner max attacks and outer min detector updates, the approach yields near‑zero false positives and very low false negatives across diverse tasks and backends, outperforming prior known‑answer detectors and baseline defenses. The results demonstrate practical robustness against existing and adaptive prompt injection attacks when injected instructions are present, with reasonable computational overhead and tunable trade‑offs. This work offers a viable defense‑in‑depth strategy for safeguarding LLM‑integrated systems against evolving prompt‑driven threats.

Abstract

LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.

Paper Structure

This paper contains 31 sections, 6 equations, 6 figures, 16 tables, 3 algorithms.

Figures (6)

  • Figure 1: Illustration of the key difference between known-answer detection and DataSentinel, where the former uses a standard LLM as a detection LLM while the latter fine-tunes the detection LLM via a game-theoretic method.
  • Figure 2: Illustration of fine-tuning the detection LLM $g$. DataSentinel repeats the three steps for multiple rounds.
  • Figure 3: (a) Impact of $r$; (b) Impact of $|D|$.
  • Figure 4: (a) Impact of $\alpha$; (b) Impact of $\beta$.
  • Figure 5: (a) Impact of $n_{in}$; (b) Impact of $n_{out}$.
  • ...and 1 more figures