Table of Contents
Fetching ...

How Not to Detect Prompt Injections with an LLM

Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, Somesh Jha

TL;DR

The paper analyzes prompt injection defenses, focusing on Known-Answer Detection (KAD) and its Strong variant DataSentinel. It reveals a structural vulnerability: KAD relies on the detection LLM following injected instructions, which adaptive attacks can exploit to leak the secret key and cause the backend LLM to execute injected tasks. The authors introduce DataFlip, a handcrafted IF/ELSE prompt that triggers secret-key leakage while steering the backend toward the injected goal, achieving near-perfect evasion across models and tasks. Experiments show DataFlip reduces detection rates to near zero and yields high attack success, even against Fine-tuned detectors, highlighting a fundamental weakness in output-only defenses. The work argues for defenses that inspect internal prompt behavior or reasoning rather than solely relying on observable outputs.

Abstract

LLM-integrated applications and agents are vulnerable to prompt injection attacks, where adversaries embed malicious instructions within seemingly benign input data to manipulate the LLM's intended behavior. Recent defenses based on known-answer detection (KAD) scheme have reported near-perfect performance by observing an LLM's output to classify input data as clean or contaminated. KAD attempts to repurpose the very susceptibility to prompt injection as a defensive mechanism. We formally characterize the KAD scheme and uncover a structural vulnerability that invalidates its core security premise. To exploit this fundamental vulnerability, we methodically design an adaptive attack, DataFlip. It consistently evades KAD defenses, achieving detection rates as low as $0\%$ while reliably inducing malicious behavior with a success rate of $91\%$, all without requiring white-box access to the LLM or any optimization procedures.

How Not to Detect Prompt Injections with an LLM

TL;DR

The paper analyzes prompt injection defenses, focusing on Known-Answer Detection (KAD) and its Strong variant DataSentinel. It reveals a structural vulnerability: KAD relies on the detection LLM following injected instructions, which adaptive attacks can exploit to leak the secret key and cause the backend LLM to execute injected tasks. The authors introduce DataFlip, a handcrafted IF/ELSE prompt that triggers secret-key leakage while steering the backend toward the injected goal, achieving near-perfect evasion across models and tasks. Experiments show DataFlip reduces detection rates to near zero and yields high attack success, even against Fine-tuned detectors, highlighting a fundamental weakness in output-only defenses. The work argues for defenses that inspect internal prompt behavior or reasoning rather than solely relying on observable outputs.

Abstract

LLM-integrated applications and agents are vulnerable to prompt injection attacks, where adversaries embed malicious instructions within seemingly benign input data to manipulate the LLM's intended behavior. Recent defenses based on known-answer detection (KAD) scheme have reported near-perfect performance by observing an LLM's output to classify input data as clean or contaminated. KAD attempts to repurpose the very susceptibility to prompt injection as a defensive mechanism. We formally characterize the KAD scheme and uncover a structural vulnerability that invalidates its core security premise. To exploit this fundamental vulnerability, we methodically design an adaptive attack, DataFlip. It consistently evades KAD defenses, achieving detection rates as low as while reliably inducing malicious behavior with a success rate of , all without requiring white-box access to the LLM or any optimization procedures.

Paper Structure

This paper contains 25 sections, 15 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Overview of KAD. Part (1) illustrates KAD under benign input, where the detection LLM follows the detection instruction and returns the secret key—correctly classifying the input as Clean. Part (2) shows KAD under a basic attack, where the detection LLM follows the injected instruction and returns an adversarial output—correctly classifying the input as Contaminated. Part (3) presents KAD under our adaptive attack (DataFlip), where the detection LLM follows the IF clause of the injected instruction to return the secret key—causing KAD to misclassify the input as Clean and allowing it to bypass detection.