Securing Large Language Models (LLMs) from Prompt Injection Attacks
Omar Farooq Khan Suri, John McCrae
TL;DR
This work probes whether JATMO-style fine-tuning of non–instruction-tuned LLMs trades some instruction-following susceptibility for improved resilience against prompt injection. By adapting the HOUYI genetic-attack framework and testing on LLaMA 2 and Qwen models alongside a GPT-3.5-Turbo baseline, it shows that JATMO substantially reduces, but does not eliminate, injection success, with a notable trade-off where higher-quality generations are more prone to adversarial prompts. The study provides a nuanced view of the strengths and limits of fine-tuning-based defenses and argues for layered, system-level mitigations to achieve robust security in real-world deployments. Overall, the results motivate combining task-focused fine-tuning with detection, validation, and constrained-generation strategies to mitigate prompt-injection risks more comprehensively.
Abstract
Large Language Models (LLMs) are increasingly being deployed in real-world applications, but their flexibility exposes them to prompt injection attacks. These attacks leverage the model's instruction-following ability to make it perform malicious tasks. Recent work has proposed JATMO, a task-specific fine-tuning approach that trains non-instruction-tuned base models to perform a single function, thereby reducing susceptibility to adversarial instructions. In this study, we evaluate the robustness of JATMO against HOUYI, a genetic attack framework that systematically mutates and optimizes adversarial prompts. We adapt HOUYI by introducing custom fitness scoring, modified mutation logic, and a new harness for local model testing, enabling a more accurate assessment of defense effectiveness. We fine-tuned LLaMA 2-7B, Qwen1.5-4B, and Qwen1.5-0.5B models under the JATMO methodology and compared them with a fine-tuned GPT-3.5-Turbo baseline. Results show that while JATMO reduces attack success rates relative to instruction-tuned models, it does not fully prevent injections; adversaries exploiting multilingual cues or code-related disruptors still bypass defenses. We also observe a trade-off between generation quality and injection vulnerability, suggesting that better task performance often correlates with increased susceptibility. Our results highlight both the promise and limitations of fine-tuning-based defenses and point toward the need for layered, adversarially informed mitigation strategies.
