Table of Contents
Fetching ...

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

Ting-Chun Liu, Ching-Yu Hsu, Kuan-Yi Lee, Chi-An Fu, Hung-yi Lee

TL;DR

AEGIS presents an automated co-evolutionary framework to defend against prompt injection attacks in LLMs by jointly evolving attacker and defender prompts through Textual Gradient Optimization (TGO+) guided by an LLM evaluation loop. It operates without model fine-tuning, enabling black-box applicability, and demonstrates state-of-the-art robustness on automated assignment grading across multiple LLMs. The work highlights the importance of co-evolution, gradient replay, and multi-objective optimization, and shows cross-model generalizability and prompt transferability. These findings indicate that adversarial training at the prompt level can be a scalable and effective defense for real-world LLM deployments.

Abstract

Prompt injection attacks pose a significant challenge to the safe deployment of Large Language Models (LLMs) in real-world applications. While prompt-based detection offers a lightweight and interpretable defense strategy, its effectiveness has been hindered by the need for manual prompt engineering. To address this issue, we propose AEGIS , an Automated co-Evolutionary framework for Guarding prompt Injections Schema. Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique. This framework enables both attackers and defenders to autonomously evolve via a Textual Gradient Optimization (TGO) module, leveraging feedback from an LLM-guided evaluation loop. We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines, achieving superior robustness in both attack success and detection. Specifically, the attack success rate (ASR) reaches 1.0, representing an improvement of 0.26 over the baseline. For detection, the true positive rate (TPR) improves by 0.23 compared to the previous best work, reaching 0.84, and the true negative rate (TNR) remains comparable at 0.89. Ablation studies confirm the importance of co-evolution, gradient buffering, and multi-objective optimization. We also confirm that this framework is effective in different LLMs. Our results highlight the promise of adversarial training as a scalable and effective approach for guarding prompt injections.

AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema

TL;DR

AEGIS presents an automated co-evolutionary framework to defend against prompt injection attacks in LLMs by jointly evolving attacker and defender prompts through Textual Gradient Optimization (TGO+) guided by an LLM evaluation loop. It operates without model fine-tuning, enabling black-box applicability, and demonstrates state-of-the-art robustness on automated assignment grading across multiple LLMs. The work highlights the importance of co-evolution, gradient replay, and multi-objective optimization, and shows cross-model generalizability and prompt transferability. These findings indicate that adversarial training at the prompt level can be a scalable and effective defense for real-world LLM deployments.

Abstract

Prompt injection attacks pose a significant challenge to the safe deployment of Large Language Models (LLMs) in real-world applications. While prompt-based detection offers a lightweight and interpretable defense strategy, its effectiveness has been hindered by the need for manual prompt engineering. To address this issue, we propose AEGIS , an Automated co-Evolutionary framework for Guarding prompt Injections Schema. Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique. This framework enables both attackers and defenders to autonomously evolve via a Textual Gradient Optimization (TGO) module, leveraging feedback from an LLM-guided evaluation loop. We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines, achieving superior robustness in both attack success and detection. Specifically, the attack success rate (ASR) reaches 1.0, representing an improvement of 0.26 over the baseline. For detection, the true positive rate (TPR) improves by 0.23 compared to the previous best work, reaching 0.84, and the true negative rate (TNR) remains comparable at 0.89. Ablation studies confirm the importance of co-evolution, gradient buffering, and multi-objective optimization. We also confirm that this framework is effective in different LLMs. Our results highlight the promise of adversarial training as a scalable and effective approach for guarding prompt injections.

Paper Structure

This paper contains 38 sections, 3 equations, 5 figures, 11 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of adversarial co-evolution framework to systematically explore defenses against prompt injection attacks.
  • Figure 2: Overview of the Co-evolutionary Adversarial Framework. The system continuously co-optimizes attack and defense prompt candidates through interaction with a main application. Prompt candidates are evaluated based on the formula (1) and (2), and gradient-like feedback is used to iteratively evolve both attackers and defenders, encouraging robustness and adaptivity across adversarial interactions.
  • Figure 3: Overview of the Textual Gradient Optimization (TGO) module. The TGO module iteratively improves prompts by simulating gradient-based optimization using language model feedback. Grading results are sampled to construct error strings and generate gradient messages, which are then processed by an LLM to obtain feedback. These feedbacks are used to guide the editing of prompts based on the optimization type (e.g., attack or defense).
  • Figure 4: Iterative performance of the attacker and defender across GAN iterations, measured by True Negative Rate (TNR). Shaded regions represent the standard deviation across runs. No obvious difference can be seen for each ablation setup, all of them achieving TPR around 0.9 at each iteration.
  • Figure 5: Iterative performance of the attacker and defender across GAN iterations, measured by True Positive Rate (TPR). Shaded region represent the standard deviation across runs. The default method has the steadiest improvement and achieve the best TPR at last.