Table of Contents
Fetching ...

Maatphor: Automated Variant Analysis for Prompt Injection Attacks

Ahmed Salem, Andrew Paverd, Boris Köpf

TL;DR

Prompt injection poses a security risk to LLM-enabled systems, and defenses against emerging variants are often ad hoc. Maatphor offers an automated pipeline that (i) derives the adversary goal from a seed prompt, (ii) automatically generates diverse prompt variants via an LLM guided by seven design principles, and (iii) evaluates each variant purely from model outputs using both string-based and embedding-based metrics with a feedback loop. The authors demonstrate Maatphor on three tasks (misinformation, fraud, and style changes), showing it can convert an ineffective seed into multiple effective variants within tens of iterations and with robust evaluation. This work provides a scalable method to test defenses, build variant-rich datasets for jailbreak/prompt-injection research, and informs defense strategies against evolving prompt-based attacks.

Abstract

Prompt injection has emerged as a serious security threat to large language models (LLMs). At present, the current best-practice for defending against newly-discovered prompt injection techniques is to add additional guardrails to the system (e.g., by updating the system prompt or using classifiers on the input and/or output of the model.) However, in the same way that variants of a piece of malware are created to evade anti-virus software, variants of a prompt injection can be created to evade the LLM's guardrails. Ideally, when a new prompt injection technique is discovered, candidate defenses should be tested not only against the successful prompt injection, but also against possible variants. In this work, we present, a tool to assist defenders in performing automated variant analysis of known prompt injection attacks. This involves solving two main challenges: (1) automatically generating variants of a given prompt according, and (2) automatically determining whether a variant was effective based only on the output of the model. This tool can also assist in generating datasets for jailbreak and prompt injection attacks, thus overcoming the scarcity of data in this domain. We evaluate Maatphor on three different types of prompt injection tasks. Starting from an ineffective (0%) seed prompt, Maatphor consistently generates variants that are at least 60% effective within the first 40 iterations.

Maatphor: Automated Variant Analysis for Prompt Injection Attacks

TL;DR

Prompt injection poses a security risk to LLM-enabled systems, and defenses against emerging variants are often ad hoc. Maatphor offers an automated pipeline that (i) derives the adversary goal from a seed prompt, (ii) automatically generates diverse prompt variants via an LLM guided by seven design principles, and (iii) evaluates each variant purely from model outputs using both string-based and embedding-based metrics with a feedback loop. The authors demonstrate Maatphor on three tasks (misinformation, fraud, and style changes), showing it can convert an ineffective seed into multiple effective variants within tens of iterations and with robust evaluation. This work provides a scalable method to test defenses, build variant-rich datasets for jailbreak/prompt-injection research, and informs defense strategies against evolving prompt-based attacks.

Abstract

Prompt injection has emerged as a serious security threat to large language models (LLMs). At present, the current best-practice for defending against newly-discovered prompt injection techniques is to add additional guardrails to the system (e.g., by updating the system prompt or using classifiers on the input and/or output of the model.) However, in the same way that variants of a piece of malware are created to evade anti-virus software, variants of a prompt injection can be created to evade the LLM's guardrails. Ideally, when a new prompt injection technique is discovered, candidate defenses should be tested not only against the successful prompt injection, but also against possible variants. In this work, we present, a tool to assist defenders in performing automated variant analysis of known prompt injection attacks. This involves solving two main challenges: (1) automatically generating variants of a given prompt according, and (2) automatically determining whether a variant was effective based only on the output of the model. This tool can also assist in generating datasets for jailbreak and prompt injection attacks, thus overcoming the scarcity of data in this domain. We evaluate Maatphor on three different types of prompt injection tasks. Starting from an ineffective (0%) seed prompt, Maatphor consistently generates variants that are at least 60% effective within the first 40 iterations.
Paper Structure (19 sections, 3 figures, 4 tables, 2 algorithms)

This paper contains 19 sections, 3 figures, 4 tables, 2 algorithms.

Figures (3)

  • Figure 1: An overview of Maatphor.
  • Figure 2: The maximum and individual effectiveness scores for various iterations of Maatphor. Each iteration corresponds to a new variant, illustrating the performance progression over time.
  • Figure 3: Comparing the counts of scores for running Maatphor with and without the feedback loop.