Table of Contents
Fetching ...

MaMa: A Game-Theoretic Approach for Designing Safe Agentic Systems

Jonathan Nöther, Adish Singla, Goran Radanovic

TL;DR

MaMa presents a novel game-theoretic framework for designing safe agentic systems that remain robust under worst-case adversarial compromises. By casting the design process as a Stackelberg security game between a Meta-Agent and a Meta-Adversary and leveraging iterative LLM-based adversarial search, MaMa discovers architectures that defend against the strongest discovered attacks while maintaining high task performance. The approach introduces a structured agentic system representation, a formal threat model with a defined safety objective, and an archive-guided search strategy that samples strong prior designs. Experimental results across four diverse environments show substantial safety gains with minimal to no sacrifice in quality, and evidence of transferability to stronger adversaries, different LLMs, and targeted attacks, indicating robust safety beyond training conditions.

Abstract

LLM-based multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agentic systems that remain safe even when a subset of agents is compromised. We formalize this challenge as a Stackelberg security game between a system designer (the Meta-Agent) and a best-responding Meta-Adversary that selects and compromises a subset of agents to minimize safety. We propose Meta-Adversary-Meta-Agent (MaMa), a novel algorithm for approximately solving this game and automatically designing safe agentic systems. Our approach uses LLM-based adversarial search, where the Meta-Agent iteratively proposes system designs and receives feedback based on the strongest attacks discovered by the Meta-Adversary. Empirical evaluations across diverse environments show that systems designed with MaMa consistently defend against worst-case attacks while maintaining performance comparable to systems optimized solely for task success. Moreover, the resulting systems generalize to stronger adversaries, as well as ones with different attack objectives or underlying LLMs, demonstrating robust safety beyond the training setting.

MaMa: A Game-Theoretic Approach for Designing Safe Agentic Systems

TL;DR

MaMa presents a novel game-theoretic framework for designing safe agentic systems that remain robust under worst-case adversarial compromises. By casting the design process as a Stackelberg security game between a Meta-Agent and a Meta-Adversary and leveraging iterative LLM-based adversarial search, MaMa discovers architectures that defend against the strongest discovered attacks while maintaining high task performance. The approach introduces a structured agentic system representation, a formal threat model with a defined safety objective, and an archive-guided search strategy that samples strong prior designs. Experimental results across four diverse environments show substantial safety gains with minimal to no sacrifice in quality, and evidence of transferability to stronger adversaries, different LLMs, and targeted attacks, indicating robust safety beyond training conditions.

Abstract

LLM-based multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agentic systems that remain safe even when a subset of agents is compromised. We formalize this challenge as a Stackelberg security game between a system designer (the Meta-Agent) and a best-responding Meta-Adversary that selects and compromises a subset of agents to minimize safety. We propose Meta-Adversary-Meta-Agent (MaMa), a novel algorithm for approximately solving this game and automatically designing safe agentic systems. Our approach uses LLM-based adversarial search, where the Meta-Agent iteratively proposes system designs and receives feedback based on the strongest attacks discovered by the Meta-Adversary. Empirical evaluations across diverse environments show that systems designed with MaMa consistently defend against worst-case attacks while maintaining performance comparable to systems optimized solely for task success. Moreover, the resulting systems generalize to stronger adversaries, as well as ones with different attack objectives or underlying LLMs, demonstrating robust safety beyond the training setting.
Paper Structure (33 sections, 3 equations, 6 figures, 8 tables, 2 algorithms)

This paper contains 33 sections, 3 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: Illustration of our proposed approach based on a simplified but real example. We model the design of safe agentic systems as a game between two players: a Meta-Agent and a Meta-Adversary. The Meta-Agent analyzes previously discovered attacks and proposes system modifications to mitigate them. The Meta-Adversary searches for and evaluates a diverse set of possible attacks against the updated system and returns the strongest ones found. Both the revised system and the corresponding attacks are stored in the Meta-Agent’s archive for future iterations. In this example, the Meta-Agent identifies a vulnerability in the central Planner agent, which can issue unfiltered instructions to action-taking agents. To address this, the Meta-Agent introduces a defensive Safety agent that oversees and filters the Planner’s outputs. As a result, direct attacks on the Planner are no longer effective, forcing the Meta-Adversary to resort to weaker strategies, such as attacking action-taking agents directly. This iterative process yields a system that is significantly more robust to malicious agents.
  • Figure 2: Quality and safety of generated systems for all 4 environments. The shaded regions indicate the standard deviation over three distinct evaluations. Since we are interested in a worst-case adversary, we depict the five strongest discovered attacks. The quality is measured over 25 iterations of the benign system. We utilize a unique Meta-Adversary for every generated system. Each point depicts the maximum of the current and all previous generations. In all environments, MaMa is able to significantly improve the safety of the initial system (Iteration 0), while maintaining, and in some cases even increasing, the quality. Compared to baselines, the systems generated by MaMa are significantly safer. The quality loss in all cases is minimal, and in some cases MaMa even generates systems that are able to produce higher-quality outputs than ones only optimized for quality.
  • Figure 3: Average safety score of the top-5 attacks for differing amounts of adversarial agents. Most systems remain safe when the adversary controls up to three agents in the system. However, once the adversary controls more than four agents, there is a steeper drop in the safety of the systems.
  • Figure 4: Illustration of the best discovered system for each environment. Bold names highlight agents that were newly added compared to the initial system. These new agents involve mostly agents tasked with ensuring the safety of the system, either by serving as reviewers, or by overseeing other agents. However, there are some examples of agents tasked with improving the quality of the outputs.
  • Figure 5: Extended results across all four environments. The results demonstrate that additionally to the collected values, the Quality of an attacked system and the safety of a clean system do remain stable.
  • ...and 1 more figures