An Empirical Game-Theoretic Analysis of Autonomous Cyber-Defence Agents
Gregory Palmer, Luke Swaby, Daniel J. B. Harrold, Matthew Stewart, Alex Hiles, Chris Willis, Ian Miles, Sara Farmer
TL;DR
This work addresses the challenge of learning robust autonomous cyber-defence policies in adversarial, high-dimensional settings by framing ACD as a partially observable Markov game and applying an empirical game-theoretic analysis with a principled double oracle (DO) backbone. It extends DO with multiple response oracles (MRO) and introduces value-function based potential-based reward shaping (VF-PBRS) along with pre-trained model sampling (PTMs) to accelerate convergence and improve policy robustness. Through empirical studies on CybORG CAGE CC2 and CC4 environments, the authors demonstrate that VF-PBRS and PTMs can yield stronger, more generalisable Blue policies and that MRO preserves convergence guarantees while enabling richer policy mixtures. The findings underscore the importance of adversarially evaluating ACD approaches against diverse, worst-case attackers and highlight practical considerations for deployment, including computation, ensemble design, and ethical implications of adversarial learning in high-fidelity cyber environments.
Abstract
The recent rise in increasingly sophisticated cyber-attacks raises the need for robust and resilient autonomous cyber-defence (ACD) agents. Given the variety of cyber-attack tactics, techniques and procedures (TTPs) employed, learning approaches that can return generalisable policies are desirable. Meanwhile, the assurance of ACD agents remains an open challenge. We address both challenges via an empirical game-theoretic analysis of deep reinforcement learning (DRL) approaches for ACD using the principled double oracle (DO) algorithm. This algorithm relies on adversaries iteratively learning (approximate) best responses against each others' policies; a computationally expensive endeavour for autonomous cyber operations agents. In this work we introduce and evaluate a theoretically-sound, potential-based reward shaping approach to expedite this process. In addition, given the increasing number of open-source ACD-DRL approaches, we extend the DO formulation to allow for multiple response oracles (MRO), providing a framework for a holistic evaluation of ACD approaches.
