Table of Contents
Fetching ...

Entity-based Reinforcement Learning for Autonomous Cyber Defence

Isaac Symes Thompson, Alberto Caron, Chris Hicks, Vasilios Mavroudis

TL;DR

This work tackles the problem of generalising autonomous cyber defence policies across diverse network topologies. It introduces entity-based reinforcement learning using the Entity Gym framework and the RogueNet Transformer policy to enable compositional generalisation over variable node populations, evaluated on the Yawning Titan simulator across networks of sizes $n \\in \\{10,20,40\\}$ with zero-shot tests on unseen sizes. Compared to fixed-input MLP baselines, the entity-based approach shows superior learning in randomly changing topologies and strong zero-shot transfer, supporting broader applicability in real-world, dynamic networks. The authors provide an open-source implementation and discuss pathways to further enhance realism and robustness through global information, richer topology variation, and graph-aware architectures.

Abstract

A significant challenge for autonomous cyber defence is ensuring a defensive agent's ability to generalise across diverse network topologies and configurations. This capability is necessary for agents to remain effective when deployed in dynamically changing environments, such as an enterprise network where devices may frequently join and leave. Standard approaches to deep reinforcement learning, where policies are parameterised using a fixed-input multi-layer perceptron (MLP) expect fixed-size observation and action spaces. In autonomous cyber defence, this makes it hard to develop agents that generalise to environments with network topologies different from those trained on, as the number of nodes affects the natural size of the observation and action spaces. To overcome this limitation, we reframe the problem of autonomous network defence using entity-based reinforcement learning, where the observation and action space of an agent are decomposed into a collection of discrete entities. This framework enables the use of policy parameterisations specialised in compositional generalisation. We train a Transformer-based policy on the Yawning Titan cyber-security simulation environment and test its generalisation capabilities across various network topologies. We demonstrate that this approach significantly outperforms an MLP-based policy when training across fixed-size networks of varying topologies, and matches performance when training on a single network. We also demonstrate the potential for zero-shot generalisation to networks of a different size to those seen in training. These findings highlight the potential for entity-based reinforcement learning to advance the field of autonomous cyber defence by providing more generalisable policies capable of handling variations in real-world network environments.

Entity-based Reinforcement Learning for Autonomous Cyber Defence

TL;DR

This work tackles the problem of generalising autonomous cyber defence policies across diverse network topologies. It introduces entity-based reinforcement learning using the Entity Gym framework and the RogueNet Transformer policy to enable compositional generalisation over variable node populations, evaluated on the Yawning Titan simulator across networks of sizes with zero-shot tests on unseen sizes. Compared to fixed-input MLP baselines, the entity-based approach shows superior learning in randomly changing topologies and strong zero-shot transfer, supporting broader applicability in real-world, dynamic networks. The authors provide an open-source implementation and discuss pathways to further enhance realism and robustness through global information, richer topology variation, and graph-aware architectures.

Abstract

A significant challenge for autonomous cyber defence is ensuring a defensive agent's ability to generalise across diverse network topologies and configurations. This capability is necessary for agents to remain effective when deployed in dynamically changing environments, such as an enterprise network where devices may frequently join and leave. Standard approaches to deep reinforcement learning, where policies are parameterised using a fixed-input multi-layer perceptron (MLP) expect fixed-size observation and action spaces. In autonomous cyber defence, this makes it hard to develop agents that generalise to environments with network topologies different from those trained on, as the number of nodes affects the natural size of the observation and action spaces. To overcome this limitation, we reframe the problem of autonomous network defence using entity-based reinforcement learning, where the observation and action space of an agent are decomposed into a collection of discrete entities. This framework enables the use of policy parameterisations specialised in compositional generalisation. We train a Transformer-based policy on the Yawning Titan cyber-security simulation environment and test its generalisation capabilities across various network topologies. We demonstrate that this approach significantly outperforms an MLP-based policy when training across fixed-size networks of varying topologies, and matches performance when training on a single network. We also demonstrate the potential for zero-shot generalisation to networks of a different size to those seen in training. These findings highlight the potential for entity-based reinforcement learning to advance the field of autonomous cyber defence by providing more generalisable policies capable of handling variations in real-world network environments.

Paper Structure

This paper contains 31 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Plots showing examples of the structure of random networks used in the Yawning Titan environment, with entry nodes marked in red.
  • Figure 2: Train-time episodic rewards evaluated on the 10, 20, and 40 node networks respectively. The four agents compared in each evaluation are: baseline PPO agent on the static (sb3_static_[nodes]) and random (sb3_random_[nodes]) network environments, and the entity neural network agent on the same static (sb3_random_[nodes]) and random ('Entity_random_[nodes]') environments. Rewards are averaged as the mean over three different random seeds, and shaded error bands are constructed between the maximum and minimum of the three runs. These bands are scarcely visible as there was not a lot of deviation between the three runs.
  • Figure 3: Box-plots of episodic rewards generated from the evaluations of three entity-based agents trained on different network sizes, over 1,000 test episodes. Subfigure (a) shows the evaluations of three entity-based agents trained on 10, 20 and 40-node networks respectively, and evaluated at test-time on random 10-node networks (eval_rand_k_on_10 with $\mathbf{k \in \{10, 20, 40\}}$). Subfigure (b) shows the same agents evaluated on 20-node networks (eval_rand_k_on_20 with $\mathbf{k \in \{10, 20, 40\}}$). Subfigure (c) shows the same agents evaluated on 40 node networks (eval_rand_k_on_40 with $\mathbf{k \in \{10, 20, 40\}}$).