Towards a Generalisable Cyber Defence Agent for Real-World Computer Networks
Tim Dudman, Martyn Bull
TL;DR
TERLA introduces topological extensions for reinforcement learning in cyber defence, using heterogeneous graph transformers to produce fixed-size latent embeddings of observed networks and a reduced, interpretable action space. This enables PPO-based agents to generalise across networks with different topology and size without retraining, while maintaining defensive performance and improving action efficiency. Evaluation in the CC4 environment demonstrates comparable protection to vanilla PPO and highlights the benefits of a single TERLA agent deployed across multiple network segments. The work outlines practical steps toward real-world applicability, including action-waiting and reward shaping to address CC4 dynamics and recommendations for future maturation across larger action spaces and human–AI teaming.
Abstract
Recent advances in deep reinforcement learning for autonomous cyber defence have resulted in agents that can successfully defend simulated computer networks against cyber-attacks. However, many of these agents would need retraining to defend networks with differing topology or size, making them poorly suited to real-world networks where topology and size can vary over time. In this research we introduce a novel set of Topological Extensions for Reinforcement Learning Agents (TERLA) that provide generalisability for the defence of networks with differing topology and size, without the need for retraining. Our approach involves the use of heterogeneous graph neural network layers to produce a fixed-size latent embedding representing the observed network state. This representation learning stage is coupled with a reduced, fixed-size, semantically meaningful and interpretable action space. We apply TERLA to a standard deep reinforcement learning Proximal Policy Optimisation (PPO) agent model, and to reduce the sim-to-real gap, conduct our research using Cyber Autonomy Gym for Experimentation (CAGE) Challenge 4. This Cyber Operations Research Gym environment has many of the features of a real-world network, such as realistic Intrusion Detection System (IDS) events and multiple agents defending network segments of differing topology and size. TERLA agents retain the defensive performance of vanilla PPO agents whilst showing improved action efficiency. Generalisability has been demonstrated by showing that all TERLA agents have the same network-agnostic neural network architecture, and by deploying a single TERLA agent multiple times to defend network segments with differing topology and size, showing improved defensive performance and efficiency.
