Table of Contents
Fetching ...

Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks

Safwan Shaheer, G. M. Refatul Islam, Mohammad Rafid Hamid, Tahsin Zaman Jilan

TL;DR

This work tackles prompt injection and goal hijacking in small open-source LLMs (notably LLaMA) by proposing an automated defense-prompt generation framework that seeds defenses with Chain of Thought-inspired prompts and iteratively refines them using a larger model. It combines paraphrasing-based prevention and known-answer detection, evaluating defenses across a benchmark of attacks with metrics like APS, ADS, ASV, FPR, and FNR to demonstrate reduced attack success and false detections while preserving utility. The approach enables scalable defense construction for edge-deployed LLMs and outlines a repeatable workflow for developing robust, open-source security mechanisms. Together, the contributions advance practical prompt-security strategies for resource-constrained, open ecosystems.

Abstract

In this fast-evolving area of LLMs, our paper discusses the significant security risk presented by prompt injection attacks. It focuses on small open-sourced models, specifically the LLaMA family of models. We introduce novel defense mechanisms capable of generating automatic defenses and systematically evaluate said generated defenses against a comprehensive set of benchmarked attacks. Thus, we empirically demonstrated the improvement proposed by our approach in mitigating goal-hijacking vulnerabilities in LLMs. Our work recognizes the increasing relevance of small open-sourced LLMs and their potential for broad deployments on edge devices, aligning with future trends in LLM applications. We contribute to the greater ecosystem of open-source LLMs and their security in the following: (1) assessing present prompt-based defenses against the latest attacks, (2) introducing a new framework using a seed defense (Chain Of Thoughts) to refine the defense prompts iteratively, and (3) showing significant improvements in detecting goal hijacking attacks. Out strategies significantly reduce the success rates of the attacks and false detection rates while at the same time effectively detecting goal-hijacking capabilities, paving the way for more secure and efficient deployments of small and open-source LLMs in resource-constrained environments.

Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks

TL;DR

This work tackles prompt injection and goal hijacking in small open-source LLMs (notably LLaMA) by proposing an automated defense-prompt generation framework that seeds defenses with Chain of Thought-inspired prompts and iteratively refines them using a larger model. It combines paraphrasing-based prevention and known-answer detection, evaluating defenses across a benchmark of attacks with metrics like APS, ADS, ASV, FPR, and FNR to demonstrate reduced attack success and false detections while preserving utility. The approach enables scalable defense construction for edge-deployed LLMs and outlines a repeatable workflow for developing robust, open-source security mechanisms. Together, the contributions advance practical prompt-security strategies for resource-constrained, open ecosystems.

Abstract

In this fast-evolving area of LLMs, our paper discusses the significant security risk presented by prompt injection attacks. It focuses on small open-sourced models, specifically the LLaMA family of models. We introduce novel defense mechanisms capable of generating automatic defenses and systematically evaluate said generated defenses against a comprehensive set of benchmarked attacks. Thus, we empirically demonstrated the improvement proposed by our approach in mitigating goal-hijacking vulnerabilities in LLMs. Our work recognizes the increasing relevance of small open-sourced LLMs and their potential for broad deployments on edge devices, aligning with future trends in LLM applications. We contribute to the greater ecosystem of open-source LLMs and their security in the following: (1) assessing present prompt-based defenses against the latest attacks, (2) introducing a new framework using a seed defense (Chain Of Thoughts) to refine the defense prompts iteratively, and (3) showing significant improvements in detecting goal hijacking attacks. Out strategies significantly reduce the success rates of the attacks and false detection rates while at the same time effectively detecting goal-hijacking capabilities, paving the way for more secure and efficient deployments of small and open-source LLMs in resource-constrained environments.

Paper Structure

This paper contains 12 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Defense prompt generation workflow
  • Figure 2: Metrics for Different Attack Types
  • Figure 3: Defense Evaluation Scores by Temperature
  • Figure 4: Defense prompt generation workflow