Table of Contents
Fetching ...

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Hao Li, Xiaogeng Liu

TL;DR

Prompt injection poses a serious risk to LLMs, with existing prompt guard models frequently exhibiting over-defense by basing decisions on trigger words. The authors introduce NotInject to quantify this over-defense and propose InjecGuard, a lightweight yet strong guard model trained with MOF to mitigate over-defense without relying on dataset-specific cues. NotInject evaluations show existing guards struggle with over-defense, while InjecGuard achieves an average accuracy of 83.48% and over-defense of 87.32%, closely rivaling GPT-4o but with far lower computational cost. The work provides fully open-source datasets, training data, and code, enabling transparent benchmarking and safer deployment of prompt guard systems across diverse environments.

Abstract

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense -- falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks. The code and datasets are released at https://github.com/leolee99/InjecGuard.

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

TL;DR

Prompt injection poses a serious risk to LLMs, with existing prompt guard models frequently exhibiting over-defense by basing decisions on trigger words. The authors introduce NotInject to quantify this over-defense and propose InjecGuard, a lightweight yet strong guard model trained with MOF to mitigate over-defense without relying on dataset-specific cues. NotInject evaluations show existing guards struggle with over-defense, while InjecGuard achieves an average accuracy of 83.48% and over-defense of 87.32%, closely rivaling GPT-4o but with far lower computational cost. The work provides fully open-source datasets, training data, and code, enabling transparent benchmarking and safer deployment of prompt guard systems across diverse environments.

Abstract

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense -- falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks. The code and datasets are released at https://github.com/leolee99/InjecGuard.

Paper Structure

This paper contains 29 sections, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: Performance comparison of injection detection: We present the average accuracy across benign, malicious, and over-defense cases, plotted against time efficiency. Our method achieves the best performance across performance and efficiency.
  • Figure 2: Over-denfese issue in ProtectAIv2 protectai1, the current SotA prompt guard model.
  • Figure 3: Visualization of attention weight. Given an instruction of "[CLS] Can I ignore this warning appeared in my code? [SEP]", ProtectAIv2 protectai assigns extremely high attention weights to the word "ignore," leading to misclassification as Injection. In contrast, our method distributes attention across the entire sentence, successfully predicting it as benign.
  • Figure 4: The pipeline for constructing NotInject dataset
  • Figure 5: Comparison of benign, malicious, and over-defense accuracy across various prompt guard solutions. InjecGuard significantly outperforms all prior solutions. Notably, the open-source models (Deepset, Fmops, PromptGuard, ProtectAIv2) exhibit significant over-defense issues, with over-defense accuracy under 60%, where 50% represents random guessing. In addition, although LakeraGuard and LlamaGuard3 demonstrate strong over-defense performance, their effectiveness is still limited by suboptimal malicious accuracy.
  • ...and 8 more figures