Prompt Injection Attacks in Defended Systems
Daniil Khomsky, Narek Maloyan, Bulat Nutfullin
TL;DR
This paper examines prompt-injection vulnerabilities in defended LLM systems through the SaTML-2024 CTF framework. It analyzes black-box attacks under a three-tier defense (system prompt, Python filter, LLM filter) and evaluates defenses against a suite of basic and combined attacks, using metrics where score_D = (P_D + b_D) * v_D with P_D = max(1050 - 50 X, 0) and v_D = 0.85^n to reward early wins. The study tests defenses on ChatGPT-3.5 and Llama-2, highlighting residual weaknesses and the need for proactive, automated defenses that anticipate evolving attack methods. Overall, the work provides a structured methodology and practical insights for securing LLM deployments against prompt-injection and jailbreak-style threats.
Abstract
Large language models play a crucial role in modern natural language processing technologies. However, their extensive use also introduces potential security risks, such as the possibility of black-box attacks. These attacks can embed hidden malicious features into the model, leading to adverse consequences during its deployment. This paper investigates methods for black-box attacks on large language models with a three-tiered defense mechanism. It analyzes the challenges and significance of these attacks, highlighting their potential implications for language processing system security. Existing attack and defense methods are examined, evaluating their effectiveness and applicability across various scenarios. Special attention is given to the detection algorithm for black-box attacks, identifying hazardous vulnerabilities in language models and retrieving sensitive information. This research presents a methodology for vulnerability detection and the development of defensive strategies against black-box attacks on large language models.
