InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Hao Li; Xiaogeng Liu

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Hao Li, Xiaogeng Liu

TL;DR

Prompt injection poses a serious risk to LLMs, with existing prompt guard models frequently exhibiting over-defense by basing decisions on trigger words. The authors introduce NotInject to quantify this over-defense and propose InjecGuard, a lightweight yet strong guard model trained with MOF to mitigate over-defense without relying on dataset-specific cues. NotInject evaluations show existing guards struggle with over-defense, while InjecGuard achieves an average accuracy of 83.48% and over-defense of 87.32%, closely rivaling GPT-4o but with far lower computational cost. The work provides fully open-source datasets, training data, and code, enabling transparent benchmarking and safer deployment of prompt guard systems across diverse environments.

Abstract

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense -- falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks. The code and datasets are released at https://github.com/leolee99/InjecGuard.

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

TL;DR

Abstract

InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)