Table of Contents
Fetching ...

Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning

Yiyao Zhang, Diksha Goel, Hussain Ahmad

Abstract

Autonomous agents are increasingly deployed in both offensive and defensive cyber operations, creating high-speed, closed-loop interactions in critical infrastructure environments. Advanced Persistent Threat (APT) actors exploit "Living off the Land" techniques and targeted telemetry perturbations to induce ambiguity in monitoring systems, causing automated defenses to overreact or misclassify benign behavior as malicious activity. Existing monolithic and multi-agent defense pipelines largely operate on correlation-based signals, lack structural constraints on response actions, and are vulnerable to reasoning drift under ambiguous or adversarial inputs. We present the Causal Multi-Agent Decision Framework (C-MADF), a structurally constrained architecture for autonomous cyber defense that integrates causal modeling with adversarial dual-policy control. C-MADF first learns a Structural Causal Model (SCM) from historical telemetry and compiles it into an investigation-level Directed Acyclic Graph (DAG) that defines admissible response transitions. This roadmap is formalized as a Markov Decision Process (MDP) whose action space is explicitly restricted to causally consistent transitions. Decision-making within this constrained space is performed by a dual-agent reinforcement learning system in which a threat-optimizing Blue-Team policy is counterbalanced by a conservatively shaped Red-Team policy. Inter-policy disagreement is quantified through a Policy Divergence Score and exposed via a human-in-the-loop interface equipped with an Explainability-Transparency Score that serves as an escalation signal under uncertainty. On the real-world CICIoT2023 dataset, C-MADF reduces the false-positive rate from 11.2%, 9.7%, and 8.4% in three cutting-edge literature baselines to 1.8%, while achieving 0.997 precision, 0.961 recall, and 0.979 F1-score.

Explainable Autonomous Cyber Defense using Adversarial Multi-Agent Reinforcement Learning

Abstract

Autonomous agents are increasingly deployed in both offensive and defensive cyber operations, creating high-speed, closed-loop interactions in critical infrastructure environments. Advanced Persistent Threat (APT) actors exploit "Living off the Land" techniques and targeted telemetry perturbations to induce ambiguity in monitoring systems, causing automated defenses to overreact or misclassify benign behavior as malicious activity. Existing monolithic and multi-agent defense pipelines largely operate on correlation-based signals, lack structural constraints on response actions, and are vulnerable to reasoning drift under ambiguous or adversarial inputs. We present the Causal Multi-Agent Decision Framework (C-MADF), a structurally constrained architecture for autonomous cyber defense that integrates causal modeling with adversarial dual-policy control. C-MADF first learns a Structural Causal Model (SCM) from historical telemetry and compiles it into an investigation-level Directed Acyclic Graph (DAG) that defines admissible response transitions. This roadmap is formalized as a Markov Decision Process (MDP) whose action space is explicitly restricted to causally consistent transitions. Decision-making within this constrained space is performed by a dual-agent reinforcement learning system in which a threat-optimizing Blue-Team policy is counterbalanced by a conservatively shaped Red-Team policy. Inter-policy disagreement is quantified through a Policy Divergence Score and exposed via a human-in-the-loop interface equipped with an Explainability-Transparency Score that serves as an escalation signal under uncertainty. On the real-world CICIoT2023 dataset, C-MADF reduces the false-positive rate from 11.2%, 9.7%, and 8.4% in three cutting-edge literature baselines to 1.8%, while achieving 0.997 precision, 0.961 recall, and 0.979 F1-score.

Paper Structure

This paper contains 47 sections, 5 theorems, 47 equations, 7 figures, 13 tables, 4 algorithms.

Key Result

Theorem 1

Let $\mathcal{G}^\star=(V,E^\star)$ be the true causal DAG and $\mathcal{G}=(V,E)$ the learned DAG. Let $\mathcal{M}=(S,A,P)$ satisfy $P(s' \mid s,a)>0 \Rightarrow (s,s')\in E$. Let $\tau=(s_0,\dots,s_T)$ be the random trajectory generated under any admissible policy $\pi$. If $\Pr_\pi[(s_t,s_{t+1}) In particular, if $\rho=0$, then $(s_t,s_{t+1})\in E^\star$ holds almost surely for all $t$. $\blac

Figures (7)

  • Figure 1: Illustration of a Shadow-Jitter telemetry manipulation scenario. Controlled perturbations in host and network logs distort apparent event correlations, leading an unconstrained correlation-based defense to misclassify benign activity as malicious. In contrast, C-MADF applies causal filtering and adversarial dual-policy validation within a constrained MDP-DAG structure, reducing spurious mitigation actions under ambiguous observations.
  • Figure 2: The architecture of the Causal Multi-Agent Decision Framework (C-MADF). The process begins with the Causal Discovery Module learning a causal model from data. This model informs the MDP-DAG Roadmap, which provides a verifiable structure for investigations. The Council of Rivals, consisting of a Blue-Team and a Red-Team agent, deliberates on the best course of action within this roadmap. Their debate and the resulting policy divergence are fed into the Explainable Human-in-the-Loop Interface, which computes the ETS and presents a clear recommendation to the supervisory adjudication.
  • Figure 3: The MDP-DAG Investigation Roadmap showing valid state transitions (orange arrows with reward values) and causally inconsistent blocked transitions (red X). The roadmap constrains the action space to ensure verifiable investigation flows, preventing logically inconsistent paths such as jumping directly from Initial Alert to Threat Mitigated without sufficient evidence collection.
  • Figure 4: Council of Rivals adversarial deliberation architecture. At each investigation state $s_t$ in the causally constrained MDP-DAG, the Blue-Team policy proposes a mitigation action based on its learned value estimates, while the Red-Team policy evaluates and challenges the proposal under a conservatively shaped objective. The interaction proceeds through iterative hypothesis evaluation, evidential justification, and action arbitration within the constrained action space. Inter-policy disagreement is quantified via the Policy Divergence Score $\mathcal{D}(s_t)$, which serves as an epistemic uncertainty signal and triggers human escalation when $\mathcal{D}(s_t) > \tau_{\text{div}}$. The policies are trained via self-play in a two-player stochastic game defined over the MDP-DAG, yielding stabilized adversarial decision policies.
  • Figure 5: Illustrative decomposition of the ETS into its three primary components, Clarity, Completeness, and Confidence, and their respective sub-metrics. The figure visualizes the aggregation structure used to compute ETS$(s_t)$ and demonstrates how the resulting score is compared against the escalation threshold $\tau_{\text{escalate}}$. In this example, ETS$(s_t)=0.82$ exceeds the calibrated threshold, and the system proceeds without mandatory human escalation. When ETS$(s_t)$ falls below the threshold, structured explanation artifacts and human oversight are triggered.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Theorem 1: Robust Causal Consistency
  • Proposition 1: Monotonic Reduction of False-Positive Mitigations under Gating
  • Theorem 2: TV Robustness of False-Positive Rate
  • Corollary 1: TV-Adjusted FP Bound for C-MADF
  • Theorem 3: PAC-Style Lower Bound for Rejected-FP Mass