Table of Contents
Fetching ...

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Siyuan Li, Zehao Liu, Xi Lin, Qinghua Mao, Yuliang Chen, Haoyu Li, Jun Wu, Jianhua Li, Xiu Su

Abstract

As Large Language Models (LLMs) are increasingly deployed in complex applications, their vulnerability to adversarial attacks raises urgent safety concerns, especially those evolving over multi-round interactions. Existing defenses are largely reactive and struggle to adapt as adversaries refine strategies across rounds. In this work, we propose CoopGuard , a stateful multi-round LLM defense framework based on cooperative agents that maintains and updates an internal defense state to counter evolving attacks. It employs three specialized agents (Deferring Agent, Tempting Agent, and Forensic Agent) for complementary round-level strategies, coordinated by System Agent, which conditions decisions on the evolving defense state (interaction history) and orchestrates agents over time. To evaluate evolving threats, we introduce the EMRA benchmark with 5,200 adversarial samples across 8 attack types, simulating progressively LLM multi-round attacks. Experiments show that CoopGuard reduces attack success rate by 78.9% over state-of-the-art defenses, while improving deceptive rate by 186% and reducing attack efficiency by 167.9%, offering a more comprehensive assessment of multi-round defense. These results demonstrate that CoopGuard provides robust protection for LLMs in multi-round adversarial scenarios.

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Abstract

As Large Language Models (LLMs) are increasingly deployed in complex applications, their vulnerability to adversarial attacks raises urgent safety concerns, especially those evolving over multi-round interactions. Existing defenses are largely reactive and struggle to adapt as adversaries refine strategies across rounds. In this work, we propose CoopGuard , a stateful multi-round LLM defense framework based on cooperative agents that maintains and updates an internal defense state to counter evolving attacks. It employs three specialized agents (Deferring Agent, Tempting Agent, and Forensic Agent) for complementary round-level strategies, coordinated by System Agent, which conditions decisions on the evolving defense state (interaction history) and orchestrates agents over time. To evaluate evolving threats, we introduce the EMRA benchmark with 5,200 adversarial samples across 8 attack types, simulating progressively LLM multi-round attacks. Experiments show that CoopGuard reduces attack success rate by 78.9% over state-of-the-art defenses, while improving deceptive rate by 186% and reducing attack efficiency by 167.9%, offering a more comprehensive assessment of multi-round defense. These results demonstrate that CoopGuard provides robust protection for LLMs in multi-round adversarial scenarios.

Paper Structure

This paper contains 32 sections, 4 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the challenge posed by independent yet progressively evolving multi-round adversarial attacks on LLMs and our innovative CoopGuard multi-agent adaptive defense mechanism to effectively counter these evolving threats.
  • Figure 2: Overview of CoopGuard multi-agent jailbreak defense framework. Deferring Agent slows down probing progress via ambiguity injection, while Tempting Agent generates deceptive traps to mislead them. Forensic Agent collects and analyzes evidence of attack behaviors. System Agent oversees the agents, dynamically refining defense strategies to adapt to evolving threats. This cooperative process safeguards the system, depletes the attacker's resources, and collects intelligence on attack behavior.
  • Figure 3: Resource footprint of multi-round attacks. (a) Token consumption for defense and attack across question types. (b) Token consumption for defense and attack across jailbreak strategies.
  • Figure 4: Characteristics of jailbreak strategies. (a) Distribution of prompt lengths for each strategy type. (b) Attack-to-defense token ratio for each strategy type, indicating the attacker’s relative cost.
  • Figure 5: Evaluation of AE, quantified by the average token consumption per dialogue across different models. CoopGuard forces the attacker to expend significantly more resources (higher is better for defense) compared to baselines.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1: Stateful Defense State
  • Definition 2: Structured Agent Setting