Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

Piyush Jaiswal; Aaditya Pratap; Shreyansh Saraswati; Harsh Kasyap; Somanath Tripathy

Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

Piyush Jaiswal, Aaditya Pratap, Shreyansh Saraswati, Harsh Kasyap, Somanath Tripathy

TL;DR

This work evaluates prompt-injection and jailbreak vulnerability using a large, manually curated dataset across multiple open-source LLMs, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma variants, and evaluated several lightweight, inference-time defence mechanisms that operate as filters without any retraining or GPU-intensive fine-tuning.

Abstract

Large Language Models (LLMs) are widely deployed in real-world systems. Given their broader applicability, prompt engineering has become an efficient tool for resource-scarce organizations to adopt LLMs for their own purposes. At the same time, LLMs are vulnerable to prompt-based attacks. Thus, analyzing this risk has become a critical security requirement. This work evaluates prompt-injection and jailbreak vulnerability using a large, manually curated dataset across multiple open-source LLMs, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma variants. We observe significant behavioural variation across models, including refusal responses and complete silent non-responsiveness triggered by internal safety mechanisms. Furthermore, we evaluated several lightweight, inference-time defence mechanisms that operate as filters without any retraining or GPU-intensive fine-tuning. Although these defences mitigate straightforward attacks, they are consistently bypassed by long, reasoning-heavy prompts.

Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

TL;DR

Abstract

Paper Structure (51 sections, 3 figures, 6 tables)

This paper contains 51 sections, 3 figures, 6 tables.

Introduction
Related Work
Prompt Injection and Jailbreak Attacks
Defence Mechanisms
Open-Source Large Language Models
Models Under Evaluation
Safety Mechanisms and Known Limitations
Why These Models Matter
Evaluation Datasets and Benchmarks
Attack Setup
Models Evaluated
Adversarial Prompt Construction
Prompt Sources
Prompt Categories
Long-Format Prompting
...and 36 more sections

Figures (3)

Figure 1: Prompt Injection Vulnerability Rates
Figure 2: Jailbreak Results - Vulnerable Responses
Figure 3: Safe vs Vulnerable vs Timeout

Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

TL;DR

Abstract

Analysis of LLMs Against Prompt Injection and Jailbreak Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)