How Not to Detect Prompt Injections with an LLM
Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, Somesh Jha
TL;DR
The paper analyzes prompt injection defenses, focusing on Known-Answer Detection (KAD) and its Strong variant DataSentinel. It reveals a structural vulnerability: KAD relies on the detection LLM following injected instructions, which adaptive attacks can exploit to leak the secret key and cause the backend LLM to execute injected tasks. The authors introduce DataFlip, a handcrafted IF/ELSE prompt that triggers secret-key leakage while steering the backend toward the injected goal, achieving near-perfect evasion across models and tasks. Experiments show DataFlip reduces detection rates to near zero and yields high attack success, even against Fine-tuned detectors, highlighting a fundamental weakness in output-only defenses. The work argues for defenses that inspect internal prompt behavior or reasoning rather than solely relying on observable outputs.
Abstract
LLM-integrated applications and agents are vulnerable to prompt injection attacks, where adversaries embed malicious instructions within seemingly benign input data to manipulate the LLM's intended behavior. Recent defenses based on known-answer detection (KAD) scheme have reported near-perfect performance by observing an LLM's output to classify input data as clean or contaminated. KAD attempts to repurpose the very susceptibility to prompt injection as a defensive mechanism. We formally characterize the KAD scheme and uncover a structural vulnerability that invalidates its core security premise. To exploit this fundamental vulnerability, we methodically design an adaptive attack, DataFlip. It consistently evades KAD defenses, achieving detection rates as low as $0\%$ while reliably inducing malicious behavior with a success rate of $91\%$, all without requiring white-box access to the LLM or any optimization procedures.
