When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Max Fomin

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Max Fomin

TL;DR

This work proposes Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization and shows that LODO-stable SAE features provide more reliable explanations for classifier decisions by filtering dataset artifacts.

Abstract

Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why classifiers fail to generalize, we analyze Sparse Auto-Encoder (SAE) feature coefficients across LODO folds, finding that 28% of top features are dataset-dependent shortcuts whose class signal depends on specific dataset compositions rather than semantic content. We systematically compare production guardrails (PromptGuard 2, LlamaGuard) and LLM-as-judge approaches on our benchmark, finding all three fail on indirect attacks targeting agents (7-37% detection) and that PromptGuard 2 and LlamaGuard cannot evaluate agentic tool injection due to architectural limitations. Finally, we show that LODO-stable SAE features provide more reliable explanations for classifier decisions by filtering dataset artifacts. We release our evaluation framework at https://github.com/maxf-zn/prompt-mining to establish LODO as the appropriate protocol for prompt attack detection research.

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

TL;DR

Abstract

Paper Structure (92 sections, 12 equations, 6 figures, 15 tables)

This paper contains 92 sections, 12 equations, 6 figures, 15 tables.

Introduction
Related Work
Prompt Injection and Jailbreak Attacks.
Activation-Based Detection.
Production Guardrails.
Adversarial Robustness vs. Distribution Generalization.
SAE Features for Classification.
Dataset Shortcuts.
Methods
Problem Setup
Threat Model.
Dataset Composition.
Activation-Based Classification
Raw Activations.
SAE Features.
...and 77 more sections

Figures (6)

Figure 1: Method overview. We compile 18 datasets (105K samples) spanning jailbreaks, indirect injection, harmful requests, and benign prompts. We extract activations from Llama-3.1-8B-Instruct at layer 31 (raw) and through an SAE encoder (sparse features). Leave-One-Dataset-Out (LODO) evaluation trains on $N{-}1$ datasets and tests on the held-out dataset, revealing that standard CV overestimates performance by 8.4 percentage points (0.996 vs 0.912 AUC).
Figure 2: Example prompts from our benchmark illustrating the diversity of benign and malicious samples. Benign prompts (left) include general knowledge questions, business requests, and technical support queries. Malicious prompts (right) range from direct harmful requests (advbench), jailbreak attempts (wildjailbreak) and indirect prompt injections embedded in tool calls (InjecAgent).
Figure 3: t-SNE visualization of activations colored by dataset. Datasets form distinct clusters, enabling a trivial dataset classifier (96% CV accuracy) and explaining why classifiers learn dataset-specific shortcuts rather than generalizable attack patterns.
Figure 4: Sensitivity analysis for LODO coefficient retention. (A) Shortcut prevalence heatmap by $K$ and retention threshold (firing ratio=1.5$\times$). (B) Prevalence curves across $K$ for each retention threshold. (C) Alternative stability metrics: sign agreement remains high ($>$99%) and Spearman correlation averages 0.89 across folds. (D) Retention distribution for top-50 vs top-200 features.
Figure 5: Threshold calibration under LODO. (A) Aggregate F1 vs threshold; the pooled optimum is $t^*=0.01$ (F1=0.848) vs $t=0.5$ (F1=0.793). (B) Per-dataset optimal thresholds range from 0.01 (BIPIA, deepset) to 0.73 (jayavibhav). (C) F1 loss from using $t=0.5$ varies by dataset: BIPIA loses 17pp, deepset 8pp, while jayavibhav and safeguard lose $<$1pp.
...and 1 more figures

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

TL;DR

Abstract

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Authors

TL;DR

Abstract

Table of Contents

Figures (6)