Table of Contents
Fetching ...

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Ron Litvak

Abstract

System prompt configuration can make the difference between near-total phishing blindness and near-perfect detection in LLM email agents. We present PhishNChips, a study of 11 models under 10 prompt strategies, showing that prompt-model interaction is a first-order security variable: a single model's phishing bypass rate ranges from under 1% to 97% depending on how it is configured, while the false-positive cost of the same prompt varies sharply across models. We then show that optimizing prompts around highly predictive signals can improve benchmark performance, reaching up to 93.7% recall at 3.8% false positive rate, but also creates a brittle attack surface. In particular, domain-matching strategies perform well when legitimate emails mostly have matched sender and URL domains, yet degrade sharply when attackers invert that signal by registering matching infrastructure. Response-trace analysis shows that 98% of successful bypasses reason in ways consistent with the inverted signal: the models are following the instruction, but the instruction's core assumption has become false. A counter-intuitive corollary follows: making prompts more specific can degrade already-capable models by replacing broader multi-signal reasoning with exploitable single-signal dependence. We characterize the resulting tension between detection, usability, and adversarial robustness as a navigable tradeoff, introduce Safetility, a deployability-aware metric that penalizes false positives, and argue that closing the adversarial gap likely requires tool augmentation with external ground truth.

The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities

Abstract

System prompt configuration can make the difference between near-total phishing blindness and near-perfect detection in LLM email agents. We present PhishNChips, a study of 11 models under 10 prompt strategies, showing that prompt-model interaction is a first-order security variable: a single model's phishing bypass rate ranges from under 1% to 97% depending on how it is configured, while the false-positive cost of the same prompt varies sharply across models. We then show that optimizing prompts around highly predictive signals can improve benchmark performance, reaching up to 93.7% recall at 3.8% false positive rate, but also creates a brittle attack surface. In particular, domain-matching strategies perform well when legitimate emails mostly have matched sender and URL domains, yet degrade sharply when attackers invert that signal by registering matching infrastructure. Response-trace analysis shows that 98% of successful bypasses reason in ways consistent with the inverted signal: the models are following the instruction, but the instruction's core assumption has become false. A counter-intuitive corollary follows: making prompts more specific can degrade already-capable models by replacing broader multi-signal reasoning with exploitable single-signal dependence. We characterize the resulting tension between detection, usability, and adversarial robustness as a navigable tradeoff, introduce Safetility, a deployability-aware metric that penalizes false positives, and argue that closing the adversarial gap likely requires tool augmentation with external ground truth.

Paper Structure

This paper contains 73 sections, 1 equation, 4 figures, 9 tables.

Figures (4)

  • Figure 1: GPT-4o-mini: configuration determines security posture. Blue circles show commodity phishing recall; red diamonds show FPR; the orange dot shows recall on infrastructure phishing under the same optimized prompt. The optimized configuration drops FPR by 79 pp while maintaining 93.7% recall, but infrastructure phishing collapses recall to 30.1% ($-$64 pp) -- the same strategy that rescued the model also made it maximally exploitable.
  • Figure 2: Best operating point per model, with bubble size proportional to Safetility. Models cluster in the high-recall, low-FPR region under optimized strategies. Grok 4.1 (90.7%), GPT-5.2 (87.2%), and GPT-4o-mini (87.1%) achieve the highest Safetility scores.
  • Figure 3: Which strategies collapse? Commodity phishing recall (blue) vs. infrastructure phishing recall (red) for five strategies. Signal-based strategies (sender_url_match, trap_sender_match) collapse by 41--46 pp, while baseline and security_first strategies are unaffected or improve. This figure shows the strategy-level view; Figure 4 shows the per-model breakdown.
  • Figure 4: Which models collapse? Per-model vulnerability to infrastructure phishing under sender_url_match. Models sorted by collapse severity. Green: immune ($\Delta \leq 3$ pp); yellow: resistant (3--20 pp); red: collapsed (>20 pp). While Figure 3 shows that signal-based strategies collapse on average, this figure reveals that vulnerability varies dramatically across models, from Qwen 3 (effectively immune) to GPT-4o-mini and Gemini 3 Flash ($-$69 pp and $-$68 pp).