Table of Contents
Fetching ...

Learning Optimal Defender Strategies for CAGE-2 using a POMDP Model

Duc Huy Le, Rolf Stadler

TL;DR

This work addresses learning optimal defender strategies for the CAGE-2 cyberdefense benchmark by formalizing the scenario as a Partially Observable Markov Decision Process (POMDP) and introducing BF-PPO, which combines PPO with a particle-filter belief representation to handle the enormous state space. The approach yields an offline learning method that computes near-optimal defender policies and scales to CAGE-2’s partial observability and complex infrastructure. Key contributions include a formal POMDP model of CAGE-2, the BF-PPO algorithm for belief-based policy optimization, and an empirical evaluation in the CybORG environment showing superior performance and faster convergence compared to CARDIFF. This framework enhances the reliability and efficiency of learning defender strategies in realistic, partially observable network environments and opens avenues for extending to attacker optimization and causal modelling.

Abstract

CAGE-2 is an accepted benchmark for learning and evaluating defender strategies against cyberattacks. It reflects a scenario where a defender agent protects an IT infrastructure against various attacks. Many defender methods for CAGE-2 have been proposed in the literature. In this paper, we construct a formal model for CAGE-2 using the framework of Partially Observable Markov Decision Process (POMDP). Based on this model, we define an optimal defender strategy for CAGE-2 and introduce a method to efficiently learn this strategy. Our method, called BF-PPO, is based on PPO, and it uses particle filter to mitigate the computational complexity due to the large state space of the CAGE-2 model. We evaluate our method in the CAGE-2 CybORG environment and compare its performance with that of CARDIFF, the highest ranked method on the CAGE-2 leaderboard. We find that our method outperforms CARDIFF regarding the learned defender strategy and the required training time.

Learning Optimal Defender Strategies for CAGE-2 using a POMDP Model

TL;DR

This work addresses learning optimal defender strategies for the CAGE-2 cyberdefense benchmark by formalizing the scenario as a Partially Observable Markov Decision Process (POMDP) and introducing BF-PPO, which combines PPO with a particle-filter belief representation to handle the enormous state space. The approach yields an offline learning method that computes near-optimal defender policies and scales to CAGE-2’s partial observability and complex infrastructure. Key contributions include a formal POMDP model of CAGE-2, the BF-PPO algorithm for belief-based policy optimization, and an empirical evaluation in the CybORG environment showing superior performance and faster convergence compared to CARDIFF. This framework enhances the reliability and efficiency of learning defender strategies in realistic, partially observable network environments and opens avenues for extending to attacker optimization and causal modelling.

Abstract

CAGE-2 is an accepted benchmark for learning and evaluating defender strategies against cyberattacks. It reflects a scenario where a defender agent protects an IT infrastructure against various attacks. Many defender methods for CAGE-2 have been proposed in the literature. In this paper, we construct a formal model for CAGE-2 using the framework of Partially Observable Markov Decision Process (POMDP). Based on this model, we define an optimal defender strategy for CAGE-2 and introduce a method to efficiently learn this strategy. Our method, called BF-PPO, is based on PPO, and it uses particle filter to mitigate the computational complexity due to the large state space of the CAGE-2 model. We evaluate our method in the CAGE-2 CybORG environment and compare its performance with that of CARDIFF, the highest ranked method on the CAGE-2 leaderboard. We find that our method outperforms CARDIFF regarding the learned defender strategy and the required training time.

Paper Structure

This paper contains 27 sections, 10 equations, 4 figures, 4 tables, 3 algorithms.

Figures (4)

  • Figure 1: The network topology of CAGE-2 scenario
  • Figure 2: The transition of the attacker access state $I_{h,t}$ caused by an action from the attacker $A_t$ or the defender $D_t$; nodes present the access states; arrows present the actions that cause state transitions. The defender actions Analyse and Decoy do not change the state $I_{h,t}$ and they are therefore not included.
  • Figure 3: Belief Filter Proximal Policy Optimisation (BF-PPO)
  • Figure 4: The learning curves for our solution method, BF-PPO (blue curves) and the baseline, CARDIFF (green curves). Each row indicates a CAGE-2 attacker scenario, B-LINE and MEANDER. The left column shows the average cumulative rewards over the training period. The right column enlarges the right half of the graph on the left column. The curves show the mean and the 95% confidence interval for four training runs with different random seeds.