Table of Contents
Fetching ...

An Empirical Game-Theoretic Analysis of Autonomous Cyber-Defence Agents

Gregory Palmer, Luke Swaby, Daniel J. B. Harrold, Matthew Stewart, Alex Hiles, Chris Willis, Ian Miles, Sara Farmer

TL;DR

This work addresses the challenge of learning robust autonomous cyber-defence policies in adversarial, high-dimensional settings by framing ACD as a partially observable Markov game and applying an empirical game-theoretic analysis with a principled double oracle (DO) backbone. It extends DO with multiple response oracles (MRO) and introduces value-function based potential-based reward shaping (VF-PBRS) along with pre-trained model sampling (PTMs) to accelerate convergence and improve policy robustness. Through empirical studies on CybORG CAGE CC2 and CC4 environments, the authors demonstrate that VF-PBRS and PTMs can yield stronger, more generalisable Blue policies and that MRO preserves convergence guarantees while enabling richer policy mixtures. The findings underscore the importance of adversarially evaluating ACD approaches against diverse, worst-case attackers and highlight practical considerations for deployment, including computation, ensemble design, and ethical implications of adversarial learning in high-fidelity cyber environments.

Abstract

The recent rise in increasingly sophisticated cyber-attacks raises the need for robust and resilient autonomous cyber-defence (ACD) agents. Given the variety of cyber-attack tactics, techniques and procedures (TTPs) employed, learning approaches that can return generalisable policies are desirable. Meanwhile, the assurance of ACD agents remains an open challenge. We address both challenges via an empirical game-theoretic analysis of deep reinforcement learning (DRL) approaches for ACD using the principled double oracle (DO) algorithm. This algorithm relies on adversaries iteratively learning (approximate) best responses against each others' policies; a computationally expensive endeavour for autonomous cyber operations agents. In this work we introduce and evaluate a theoretically-sound, potential-based reward shaping approach to expedite this process. In addition, given the increasing number of open-source ACD-DRL approaches, we extend the DO formulation to allow for multiple response oracles (MRO), providing a framework for a holistic evaluation of ACD approaches.

An Empirical Game-Theoretic Analysis of Autonomous Cyber-Defence Agents

TL;DR

This work addresses the challenge of learning robust autonomous cyber-defence policies in adversarial, high-dimensional settings by framing ACD as a partially observable Markov game and applying an empirical game-theoretic analysis with a principled double oracle (DO) backbone. It extends DO with multiple response oracles (MRO) and introduces value-function based potential-based reward shaping (VF-PBRS) along with pre-trained model sampling (PTMs) to accelerate convergence and improve policy robustness. Through empirical studies on CybORG CAGE CC2 and CC4 environments, the authors demonstrate that VF-PBRS and PTMs can yield stronger, more generalisable Blue policies and that MRO preserves convergence guarantees while enabling richer policy mixtures. The findings underscore the importance of adversarially evaluating ACD approaches against diverse, worst-case attackers and highlight practical considerations for deployment, including computation, ensemble design, and ethical implications of adversarial learning in high-fidelity cyber environments.

Abstract

The recent rise in increasingly sophisticated cyber-attacks raises the need for robust and resilient autonomous cyber-defence (ACD) agents. Given the variety of cyber-attack tactics, techniques and procedures (TTPs) employed, learning approaches that can return generalisable policies are desirable. Meanwhile, the assurance of ACD agents remains an open challenge. We address both challenges via an empirical game-theoretic analysis of deep reinforcement learning (DRL) approaches for ACD using the principled double oracle (DO) algorithm. This algorithm relies on adversaries iteratively learning (approximate) best responses against each others' policies; a computationally expensive endeavour for autonomous cyber operations agents. In this work we introduce and evaluate a theoretically-sound, potential-based reward shaping approach to expedite this process. In addition, given the increasing number of open-source ACD-DRL approaches, we extend the DO formulation to allow for multiple response oracles (MRO), providing a framework for a holistic evaluation of ACD approaches.

Paper Structure

This paper contains 37 sections, 7 theorems, 37 equations, 17 figures, 10 tables, 2 algorithms.

Key Result

Theorem 1

In a finite two-player zero-sum game $v^* = max_{\pi_1} min_{\pi_2} {\mathcal{G}}_{i} (\langle \pi_1, \pi_2 \rangle) = min_{\pi_2} max_{\pi_1} {\mathcal{G}}_{i} (\langle \pi_1, \pi_2 \rangle)$.

Figures (17)

  • Figure 1: A depiction of Blue and Red ABRs from an MRO run on CC2. Rewards are plotted from Blue's perspective. Both agents are unable to find ABRs that significantly improve on the Nash payoff (the black line) between iterations 11 -- 20. Red oracles use PTMs for the first 20 iterations, but subsequently switch to random initialisations (PTM=False). Initially this oracle setting finds a policy that significantly improves on the Nash payoff. However, the Blue agent immediately counters through adjusting its mixture.
  • Figure 2: The box plot above compares VF-PBRS runs using VF ensembling (VFE) and individual VFs $k$ (where $\mu_i^k > 0$) against vanilla training runs (10 runs per setting). Statistical $p$-values of less than $0.05$ and $0.01$ are flagged with one and two asterisks respectively. Outliers are plotted as separate black circles.
  • Figure 3: An illustration of CC2's empirical game and Nash mixtures. Cells represent the mean episodic reward for each policy pairing plotted from Red's perspective (100 evaluation episodes). X and Y ticks indicate the ABR iteration in which a response was learnt. We also include original (O) and generalist (G) policies. Darker cells represent match-ups that are favorable for Red.
  • Figure 4: Depicted are the percentage of steps per episode where Red can obtain User and Privileged access on the listed nodes. We compare privileges obtained by Red against the original GPPO parameterisation, $\pi_{Blue}^o$, and the final mixture agent $\mu_{Blue}$. The attacking policy is Red's ABR against $\pi_{Blue}^o$: $\pi_{Red}^2$.
  • Figure 5: Action percentage comparison for CC4 agents tasked with defending restricted and operational zones A under the joint-policy profiles $\langle \bm{\pi}_{Blue}^o, \bm{\pi}_{Red}^o \rangle$ and $\langle \bm{\pi}_{Blue}^1, \bm{\pi}_{Red}^o \rangle$. Here, $\bm{\pi}_{Blue}^o$ are the original parameterisations from CC4KEEP; $\bm{\pi}_{Red}^o$ Red's ABR against $\bm{\pi}_{Blue}^o$, and; $\bm{\pi}_{Blue}^1$, Blue's ABR against $\bm{\pi}_{Red}^o$. The latter agents are more active in blocking traffic, and thereby mitigating Red's attacks.
  • ...and 12 more figures

Theorems & Definitions (20)

  • Definition 3.1: Nash Equilibrium
  • Theorem 1: Minmax Theorem
  • Definition 3.2: $\epsilon$-Nash Equilibrium
  • Theorem 2
  • proof
  • Definition 1.1: Approximate Best Response
  • Definition 1.2: Resource Bounded Nash Equilibrium
  • Theorem 3
  • proof
  • Theorem 4
  • ...and 10 more