Table of Contents
Fetching ...

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, Han Liu

TL;DR

AdvEvo-MARL addresses safety challenges in LLM-based multi-agent systems by embedding safety into agents through adversarial co-evolution of attackers and defenders. It introduces a two-stage training pipeline: attacker warm-up via supervised fine-tuning followed by co-evolutionary MARL, aided by a public baseline for advantage estimation to stabilize learning. Across three attack scenarios and multiple task benchmarks, AdvEvo-MARL achieves ASR below 20% and often improves task performance (up to about 3–4%), without external guard overhead, demonstrating that safety and utility can be co-optimized in a unified framework. The approach shows promise for building robust, safe, and capable MAS by leveraging evolving threats to cultivate generalized defensive strategies.

Abstract

LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

TL;DR

AdvEvo-MARL addresses safety challenges in LLM-based multi-agent systems by embedding safety into agents through adversarial co-evolution of attackers and defenders. It introduces a two-stage training pipeline: attacker warm-up via supervised fine-tuning followed by co-evolutionary MARL, aided by a public baseline for advantage estimation to stabilize learning. Across three attack scenarios and multiple task benchmarks, AdvEvo-MARL achieves ASR below 20% and often improves task performance (up to about 3–4%), without external guard overhead, demonstrating that safety and utility can be co-optimized in a unified framework. The approach shows promise for building robust, safe, and capable MAS by leveraging evolving threats to cultivate generalized defensive strategies.

Abstract

LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.

Paper Structure

This paper contains 18 sections, 6 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Framework. AdvEvo-MARL begins by warming up attacker agents through supervised fine-tuning to embed prior knowledge of jailbreak behaviors. Then, attackers and defenders learn to co-evolve via adversarial multi-agent reinforcement learning. During policy updates, agents within the same functional group (i.e., attackers or defenders) leverage a public baseline which is computed as the mean return of their respective group to estimate their individual advantages for optimization.
  • Figure 2: Task benchmark performance. AdvEvo-MARL exhibits minimal degradation and even improved results.
  • Figure 3: Performance variations under different training configurations. Left: robustness performance, AdvEvo-MARL consistently maintains the lowest ASR, Right: task performance, AdvEvo-MARL improves task utility across all settings, reaching a maximum 4% gain on LiveCodeBench.
  • Figure 4: Attacker-generated prompts Diversity.
  • Figure 5: Performance comparison of AdvEvo-MARL training with public baseline for advantage estimation (Baseline) and without using public baseline variant (No Baseline).