Table of Contents
Fetching ...

Finding Effective Security Strategies through Reinforcement Learning and Self-Play

Kim Hammar, Rolf Stadler

TL;DR

This paper treats intrusion prevention as a zero-sum Markov game between an attacker and a defender and uses self-play reinforcement learning to evolve their strategies without domain bias. It introduces an autoregressive policy that decomposes actions into node and attack/defense types, a neural-function approximation for high-dimensional state spaces, and an opponent-pool mechanism to stabilize learning in non-stationary multi-agent settings. Empirical results show that the proposed PPO-AR method outperforms baseline REINFORCE and PPO approaches and yields strategies that resemble human-like defense and attack planning, albeit with challenges in converging policies when both agents learn simultaneously. The work demonstrates that self-play can uncover effective security strategies in a simplified network-infrastructure model and highlights directions for scaling to more realistic topologies and richer action spaces, with implications for automated cyber defense and defense-hardened architectures.

Abstract

We present a method to automatically find security strategies for the use case of intrusion prevention. Following this method, we model the interaction between an attacker and a defender as a Markov game and let attack and defense strategies evolve through reinforcement learning and self-play without human intervention. Using a simple infrastructure configuration, we demonstrate that effective security strategies can emerge from self-play. This shows that self-play, which has been applied in other domains with great success, can be effective in the context of network security. Inspection of the converged policies show that the emerged policies reflect common-sense knowledge and are similar to strategies of humans. Moreover, we address known challenges of reinforcement learning in this domain and present an approach that uses function approximation, an opponent pool, and an autoregressive policy representation. Through evaluations we show that our method is superior to two baseline methods but that policy convergence in self-play remains a challenge.

Finding Effective Security Strategies through Reinforcement Learning and Self-Play

TL;DR

This paper treats intrusion prevention as a zero-sum Markov game between an attacker and a defender and uses self-play reinforcement learning to evolve their strategies without domain bias. It introduces an autoregressive policy that decomposes actions into node and attack/defense types, a neural-function approximation for high-dimensional state spaces, and an opponent-pool mechanism to stabilize learning in non-stationary multi-agent settings. Empirical results show that the proposed PPO-AR method outperforms baseline REINFORCE and PPO approaches and yields strategies that resemble human-like defense and attack planning, albeit with challenges in converging policies when both agents learn simultaneously. The work demonstrates that self-play can uncover effective security strategies in a simplified network-infrastructure model and highlights directions for scaling to more realistic topologies and richer action spaces, with implications for automated cyber defense and defense-hardened architectures.

Abstract

We present a method to automatically find security strategies for the use case of intrusion prevention. Following this method, we model the interaction between an attacker and a defender as a Markov game and let attack and defense strategies evolve through reinforcement learning and self-play without human intervention. Using a simple infrastructure configuration, we demonstrate that effective security strategies can emerge from self-play. This shows that self-play, which has been applied in other domains with great success, can be effective in the context of network security. Inspection of the converged policies show that the emerged policies reflect common-sense knowledge and are similar to strategies of humans. Moreover, we address known challenges of reinforcement learning in this domain and present an approach that uses function approximation, an opponent pool, and an autoregressive policy representation. Through evaluations we show that our method is superior to two baseline methods but that policy convergence in self-play remains a challenge.

Paper Structure

This paper contains 33 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Modeling intrusion prevention as a Markov game.
  • Figure 2: Three scenarios of the intrusion prevention game. Scenarios $1$ and $2$ model infrastructures with strong defenses (first four defense attributes) but weak detection capabilities (fifth defense attribute). In scenario $1$, each node contains one vulnerability (a defense attribute with a value $\leq 1$), whereas in scenario $2$ only one of the intermediary nodes has a vulnerability (the left one). Scenario $3$ models an infrastructure with both weak defenses and weak detection capabilities.
  • Figure 3: Attacker win ratio against the number of training iterations; the top row shows the results from training the attacker against DefendMinimal; the bottom row shows the results from training the defender against AttackMaximal; the three columns represent the three scenarios; the curve labeled PPO-AR shows the mean values of our proposed method; the results are averages over five training runs with different random seeds; the shaded regions show the standard deviation.
  • Figure 4: An illustration of a learned attack strategy, evolving from left to right. The attacker first scans a neighboring node for vulnerabilities (low defense attributes) (state $t_1$). The attacker then exploits the found vulnerability (state $t_2$), compromises the node, and scans the target node $N_{data}$ (state $t_3$). Finally, the attacker completes the intrusion by attacking $N_{data}$ (state $t_4$).
  • Figure 5: Attacker win ratio against the number of training iterations; the three sub-graphs show the results from training the attacker and the defender in self-play for the three scenarios; the curve labeled PPO-AR shows the mean values of our proposed method; the results are averages over five training runs with different random seeds; the shaded regions show the standard deviation.