Finding Effective Security Strategies through Reinforcement Learning and Self-Play
Kim Hammar, Rolf Stadler
TL;DR
This paper treats intrusion prevention as a zero-sum Markov game between an attacker and a defender and uses self-play reinforcement learning to evolve their strategies without domain bias. It introduces an autoregressive policy that decomposes actions into node and attack/defense types, a neural-function approximation for high-dimensional state spaces, and an opponent-pool mechanism to stabilize learning in non-stationary multi-agent settings. Empirical results show that the proposed PPO-AR method outperforms baseline REINFORCE and PPO approaches and yields strategies that resemble human-like defense and attack planning, albeit with challenges in converging policies when both agents learn simultaneously. The work demonstrates that self-play can uncover effective security strategies in a simplified network-infrastructure model and highlights directions for scaling to more realistic topologies and richer action spaces, with implications for automated cyber defense and defense-hardened architectures.
Abstract
We present a method to automatically find security strategies for the use case of intrusion prevention. Following this method, we model the interaction between an attacker and a defender as a Markov game and let attack and defense strategies evolve through reinforcement learning and self-play without human intervention. Using a simple infrastructure configuration, we demonstrate that effective security strategies can emerge from self-play. This shows that self-play, which has been applied in other domains with great success, can be effective in the context of network security. Inspection of the converged policies show that the emerged policies reflect common-sense knowledge and are similar to strategies of humans. Moreover, we address known challenges of reinforcement learning in this domain and present an approach that uses function approximation, an opponent pool, and an autoregressive policy representation. Through evaluations we show that our method is superior to two baseline methods but that policy convergence in self-play remains a challenge.
