Table of Contents
Fetching ...

Multi-Agent Reinforcement Learning in Cybersecurity: From Fundamentals to Applications

Christoph R. Landolt, Christoph Würsch, Roland Meier, Alain Mermoud, Julian Jang-Jaccard

TL;DR

This paper surveys the use of Multi-Agent Reinforcement Learning (MARL) for automated cyber defense, emphasizing decentralized coordination, adversarial training, and dynamic environments. It reviews foundational models (e.g., Dec-POMDPs, POSGs), training paradigms (cooperative, competitive, mixed), and key MARL algorithms (MADDPG, MAPPO, IPPO) in the context of cybersecurity. It also highlights Cyber Gyms and AICA as essential ecosystems for training, validating, and deploying MARL-driven defenses, while outlining challenges such as scalability, non-stationarity, and the simulation-to-reality gap. The work argues that MARL can significantly enhance intrusion detection, red-blue team interactions, and lateral-movement containment, provided advances in realistic environments and robust, scalable training regimes are achieved.

Abstract

Multi-Agent Reinforcement Learning (MARL) has shown great potential as an adaptive solution for addressing modern cybersecurity challenges. MARL enables decentralized, adaptive, and collaborative defense strategies and provides an automated mechanism to combat dynamic, coordinated, and sophisticated threats. This survey investigates the current state of research in MARL applications for automated cyber defense (ACD), focusing on intruder detection and lateral movement containment. Additionally, it examines the role of Autonomous Intelligent Cyber-defense Agents (AICA) and Cyber Gyms in training and validating MARL agents. Finally, the paper outlines existing challenges, such as scalability and adversarial robustness, and proposes future research directions. This also discusses how MARL integrates in AICA to provide adaptive, scalable, and dynamic solutions to counter the increasingly sophisticated landscape of cyber threats. It highlights the transformative potential of MARL in areas like intrusion detection and lateral movement containment, and underscores the value of Cyber Gyms for training and validation of AICA.

Multi-Agent Reinforcement Learning in Cybersecurity: From Fundamentals to Applications

TL;DR

This paper surveys the use of Multi-Agent Reinforcement Learning (MARL) for automated cyber defense, emphasizing decentralized coordination, adversarial training, and dynamic environments. It reviews foundational models (e.g., Dec-POMDPs, POSGs), training paradigms (cooperative, competitive, mixed), and key MARL algorithms (MADDPG, MAPPO, IPPO) in the context of cybersecurity. It also highlights Cyber Gyms and AICA as essential ecosystems for training, validating, and deploying MARL-driven defenses, while outlining challenges such as scalability, non-stationarity, and the simulation-to-reality gap. The work argues that MARL can significantly enhance intrusion detection, red-blue team interactions, and lateral-movement containment, provided advances in realistic environments and robust, scalable training regimes are achieved.

Abstract

Multi-Agent Reinforcement Learning (MARL) has shown great potential as an adaptive solution for addressing modern cybersecurity challenges. MARL enables decentralized, adaptive, and collaborative defense strategies and provides an automated mechanism to combat dynamic, coordinated, and sophisticated threats. This survey investigates the current state of research in MARL applications for automated cyber defense (ACD), focusing on intruder detection and lateral movement containment. Additionally, it examines the role of Autonomous Intelligent Cyber-defense Agents (AICA) and Cyber Gyms in training and validating MARL agents. Finally, the paper outlines existing challenges, such as scalability and adversarial robustness, and proposes future research directions. This also discusses how MARL integrates in AICA to provide adaptive, scalable, and dynamic solutions to counter the increasingly sophisticated landscape of cyber threats. It highlights the transformative potential of MARL in areas like intrusion detection and lateral movement containment, and underscores the value of Cyber Gyms for training and validation of AICA.

Paper Structure

This paper contains 21 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Reinforcement Learning (RL) control loop, inspired by Sutton et al. Sutton2005ReinforcementLA. The figure illustrates the interaction between a RL-Agent and its environment over discrete time steps. At each step $t$, the agent observes the current state $s_t$ and reward $r_t$ from the environment, selects an action $a_t$, and sends it to the environment. The environment then transitions to a new state $s_{t+1}$ and provides a new reward $r_{t+1}$, completing the loop.
  • Figure 2: Nested structure of game models. The Partially Observable Stochastic Game includes $n$ agents and $m$ partially observable states. A simpler model is the Stochastic Game, which assumes the complete observability of the states, while the Repeated Normal-Form Game involves $n$ agents interacting in a single state. Adapted from marl-book.
  • Figure 3: Interaction between multiple agents, an optional communication network, and the environment. Each agent can have its own goals, strategies, and beliefs. The agent receives individual observations and rewards from the environment. The actions of all agents are combined into a joint action that influences the state of the environment. This joint action introduces complexity in determining the causal relationship between an individual agent's actions and the resulting rewards and observations.
  • Figure 7: Centralized training and decentralized execution framework in MADDPG. During training, each agent is associated with its own centralized Q-function ($Q_1$ to $Q_N$), which uses the global state and joint actions of all agents to optimize its policy ($\pi_1$ to $\pi_N$). At execution, each agent acts based on its decentralized policy derived during training, relying only on local observations ($o_1$ to $o_N$) and actions ($a_1$ to $a_N$). This structure effectively allows agents to adapt to competitive, collaborative, and mixed-interest dynamics.
  • Figure 8: Comparison of SARL and MARL for Intrusion Detection System (IDS) architectures. In the SARL approach, sensors are deployed at key points (e.g., firewalls, web application firewalls, and critical servers) to collect and analyze network data centrally at an IDS-Management Station managed by a single RL-Agent. In contrast, the MARL approach deploys multiple RL-Agents that collaborate and share information across the network, enhancing decentralized detection of malicious activities.
  • ...and 3 more figures