Table of Contents
Fetching ...

A Scalable Approach to Solving Simulation-Based Network Security Games

Michael Lanier, Yevgeniy Vorobeychik

Abstract

We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems.

A Scalable Approach to Solving Simulation-Based Network Security Games

Abstract

We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems.
Paper Structure (18 sections, 4 theorems, 38 equations, 3 figures, 5 tables)

This paper contains 18 sections, 4 theorems, 38 equations, 3 figures, 5 tables.

Key Result

Lemma 1

Suppose a stationary policy $\pi$ satisfies Then its value function $V^\pi$ obeys

Figures (3)

  • Figure 1: High-level view of MetaDOAR. The meta-controller scores devices and selects a small subset $\mathcal{K}$; DOAR then learns a best response while its action decoding is restricted to $\mathcal{K}$.
  • Figure 2: Expected player payoff over Double Oracle iterations for the 1000-device setting. The solid line shows the mean equilibrium payoff per device, averaged over three random seeds. The shaded region denotes a one-standard-error confidence band ($\pm 1$ SE) across seeds.
  • Figure 3: Scalability of MetaDOAR (Ours) vs. baselines across 10, 100, 10000, and 20000 devices for forward-pass latency and peak memory usage. MetaDOAR (Ours) is highlighted in the legend of each plot. Smaller is better.

Theorems & Definitions (7)

  • Lemma 1
  • Theorem 1: MetaDOAR yields an $\varepsilon$–best response
  • Definition 1: MetaDOAR pruning
  • Lemma 2: From $Q^\star$-gaps to value loss
  • proof
  • Theorem 2: MetaDOAR as an $\varepsilon$--best response to a fixed mixture
  • proof