Table of Contents
Fetching ...

Learning Intrusion Prevention Policies through Optimal Stopping

Kim Hammar, Rolf Stadler

TL;DR

This work reframes intrusion prevention as a discrete-time, partially observed optimal stopping problem, showing that the defender's optimal policy can be expressed as a threshold on the posterior intrusion probability. The authors develop a POMDP model of the use case, derive a threshold property, and learn near-optimal policies via model-free reinforcement learning (PPO) in a measurement-driven emulator. The emulator provides empirically estimated distributions of alerts and login attempts to instantiate realistic POMDP episodes, enabling robust policy learning despite partial observability. Results show convergence of the learned policies, threshold-based decision rules, and performance close to the optimal policy, with insights into the relative importance of different measurements for intrusion detection and stopping decisions.

Abstract

We study automated intrusion prevention using reinforcement learning. In a novel approach, we formulate the problem of intrusion prevention as an optimal stopping problem. This formulation allows us insight into the structure of the optimal policies, which turn out to be threshold based. Since the computation of the optimal defender policy using dynamic programming is not feasible for practical cases, we approximate the optimal policy through reinforcement learning in a simulation environment. To define the dynamics of the simulation, we emulate the target infrastructure and collect measurements. Our evaluations show that the learned policies are close to optimal and that they indeed can be expressed using thresholds.

Learning Intrusion Prevention Policies through Optimal Stopping

TL;DR

This work reframes intrusion prevention as a discrete-time, partially observed optimal stopping problem, showing that the defender's optimal policy can be expressed as a threshold on the posterior intrusion probability. The authors develop a POMDP model of the use case, derive a threshold property, and learn near-optimal policies via model-free reinforcement learning (PPO) in a measurement-driven emulator. The emulator provides empirically estimated distributions of alerts and login attempts to instantiate realistic POMDP episodes, enabling robust policy learning despite partial observability. Results show convergence of the learned policies, threshold-based decision rules, and performance close to the optimal policy, with insights into the relative importance of different measurements for intrusion detection and stopping decisions.

Abstract

We study automated intrusion prevention using reinforcement learning. In a novel approach, we formulate the problem of intrusion prevention as an optimal stopping problem. This formulation allows us insight into the structure of the optimal policies, which turn out to be threshold based. Since the computation of the optimal defender policy using dynamic programming is not feasible for practical cases, we approximate the optimal policy through reinforcement learning in a simulation environment. To define the dynamics of the simulation, we emulate the target infrastructure and collect measurements. Our evaluations show that the learned policies are close to optimal and that they indeed can be expressed using thresholds.

Paper Structure

This paper contains 36 sections, 4 theorems, 26 equations, 8 figures, 5 tables.

Key Result

Theorem 1

There exists an optimal policy $\pi^{*}$ which is a threshold policy of the form: where $\alpha^{*}$ is a threshold.

Figures (8)

  • Figure 1: The IT infrastructure and the actors in the use case.
  • Figure 2: Optimal stopping formulation of intrusion prevention; the horizontal axis represents time; $T$ is the time horizon; the episode length is $T-1$; the dashed line shows the intrusion start time; the optimal policy is to stop at the time of intrusion.
  • Figure 3: Left: the reward function for the stop and continue actions; the intrusion starts at $t=29$; right: the cumulative distribution function (CDF) of the intrusion start time.
  • Figure 4: Empirical distributions of IDS alerts (top row) and login attempts on two servers (bottom row); the graphs include several distributions that are superimposed.
  • Figure 5: Learning curves; the graphs show from left to right: episodic reward, length of an episode, empirical detection probability, empirical early stopping probability, and the number of steps between the start of an intrusion and the stop action; the curves show the averages and the standard deviations of three training runs with different random seeds.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • proof : Proof of Theorem \ref{['thm:structural_result']}