LLM-Assisted AHP for Explainable Cyber Range Evaluation
Vyron Kampourakis, Georgios Kavallieratos, Georgios Spathoulas, Vasileios Gkioulos, Sokratis Katsikas
TL;DR
This paper addresses the lack of standardized, quantitative evaluation for Cyber Range (CR) platforms in critical infrastructure contexts. It proposes a CI-specific evaluation framework that combines Analytic Hierarchy Process (AHP) weighting with a Large Language Model (LLM)-driven simulated expert panel to produce explainable, reproducible CR scores across multiple criteria. Ten CI-tailored criteria are defined and weighted, with results demonstrated on two representative CRs, revealing trade-offs between realism, scalability, flexibility, and cost. The approach offers a foundation for objective, comparable CR assessments to guide providers and end-users, while highlighting limitations and avenues for validation and improvement.
Abstract
Cyber Ranges (CRs) have emerged as prominent platforms for cybersecurity training and education, especially for Critical Infrastructure (CI) sectors that face rising cyber threats. One way to address these threats is through hands-on exercises that bridge IT and OT domains to improve defensive readiness. However, consistently evaluating whether a CR platform is suitable and effective remains a challenge. This paper proposes an evaluation framework for CRs, emphasizing mission-critical settings by using a multi-criteria decision-making approach. We define a set of evaluation criteria that capture technical fidelity, training and assessment capabilities, scalability, usability, and other relevant factors. To weight and aggregate these criteria, we employ the Analytic Hierarchy Process (AHP), supported by a simulated panel of multidisciplinary experts implemented through a Large Language Model (LLM). This LLM-assisted expert reasoning enables consistent and reproducible pairwise comparisons across criteria without requiring direct expert convening. The framework's output equals quantitative scores that facilitate objective comparison of CR platforms and highlight areas for improvement. Overall, this work lays the foundation for a standardized and explainable evaluation methodology to guide both providers and end-users of CRs.
