Less is more? Rewards in RL for Cyber Defence
Elizabeth Bates, Chris Hicks, Vasilios Mavroudis
TL;DR
This paper questions the effectiveness of dense reward signals in DRL-driven autonomous cyber defence and introduces a ground-truth evaluation framework to compare sparse versus dense rewards. Using the Yawning Titan cyber gym, it compares two sparse reward schemes against a dense baseline across network sizes from 2 to 50 nodes and with both reactive and proactive defense actions, trained via PPO. The key contributions include a per-action ground-truth score, a reliability metric based on dispersion variability, and empirical evidence that a sparse positive reward improves both final defense performance and training stability across diverse settings. The findings offer practical guidance for reward design in cyber defence DRL and demonstrate that sparse rewards can generalize across different action spaces and MDP configurations.
Abstract
The last few years have seen an explosion of interest in autonomous cyber defence agents based on deep reinforcement learning. Such agents are typically trained in a cyber gym environment, also known as a cyber simulator, at least 32 of which have already been built. Most, if not all cyber gyms provide dense "scaffolded" reward functions which combine many penalties or incentives for a range of (un)desirable states and costly actions. Whilst dense rewards help alleviate the challenge of exploring complex environments, yielding seemingly effective strategies from relatively few environment steps; they are also known to bias the solutions an agent can find, potentially towards suboptimal solutions. This is especially a problem in complex cyber environments where policy weaknesses may not be noticed until exploited by an adversary. In this work we set out to evaluate whether sparse reward functions might enable training more effective cyber defence agents. Towards this goal we first break down several evaluation limitations in existing work by proposing a ground truth evaluation score that goes beyond the standard RL paradigm used to train and evaluate agents. By adapting a well-established cyber gym to accommodate our methodology and ground truth score, we propose and evaluate two sparse reward mechanisms and compare them with a typical dense reward. Our evaluation considers a range of network sizes, from 2 to 50 nodes, and both reactive and proactive defensive actions. Our results show that sparse rewards, particularly positive reinforcement for an uncompromised network state, enable the training of more effective cyber defence agents. Furthermore, we show that sparse rewards provide more stable training than dense rewards, and that both effectiveness and training stability are robust to a variety of cyber environment considerations.
