Table of Contents
Fetching ...

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

María Sanz-Gómez, Víctor Mayoral-Vilches, Francesco Balassone, Luis Javier Navarrete-Lozano, Cristóbal R. J. Veas Chavez, Maite del Mundo de Torres

TL;DR

CAIBench introduces a modular meta-benchmark for evaluating cybersecurity AI agents across offensive, defensive, knowledge, and privacy domains, addressing the fragmentation of prior benchmarks. By integrating five benchmarks—Jeopardy-style CTFs, Attack-and-Defense CTFs, Cyber Range, knowledge tests, and privacy assessments—CAIBench enables simultaneous offensive and defensive evaluation, including robotics-focused challenges (RCTF2) and GDPR-aligned privacy metrics (CyberPII-Bench). Empirical results show strong knowledge performance but limited practical capability in multi-step, adversarial, and robotics scenarios, with substantial influence from framework choices and task decomposition, revealing a persistent gap between theory and real-world security labor. The work highlights the need for continual benchmark evolution and integrated evaluation pipelines to better align AI capabilities with professional cybersecurity practice, while providing a reproducible framework for researchers and practitioners. The combination of CAIBench and CAI represents a step toward scalable, labor-relevant assessment of cybersecurity AI systems, guiding future research toward more capable and trustworthy autonomous security agents.

Abstract

Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks assess isolated skills rather than integrated performance. We find that pre-trained knowledge of cybersecurity in LLMs does not imply attack and defense abilities, revealing a gap between knowledge and capability. To address this limitation, we present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework that allows evaluating LLM models and agents across offensive and defensive cybersecurity domains, taking a step towards meaningfully measuring their labor-relevance. CAIBench integrates five evaluation categories, covering over 10,000 instances: Jeopardy-style CTFs, Attack and Defense CTFs, Cyber Range exercises, knowledge benchmarks, and privacy assessments. Key novel contributions include systematic simultaneous offensive-defensive evaluation, robotics-focused cybersecurity challenges (RCTF2), and privacy-preserving performance assessment (CyberPII-Bench). Evaluation of state-of-the-art AI models reveals saturation on security knowledge metrics (~70\% success) but substantial degradation in multi-step adversarial (A\&D) scenarios (20-40\% success), or worse in robotic targets (22\% success). The combination of framework scaffolding and LLM model choice significantly impacts performance; we find that proper matches improve up to 2.6$\times$ variance in Attack and Defense CTFs. These results demonstrate a pronounced gap between conceptual knowledge and adaptive capability, emphasizing the need for a meta-benchmark.

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

TL;DR

CAIBench introduces a modular meta-benchmark for evaluating cybersecurity AI agents across offensive, defensive, knowledge, and privacy domains, addressing the fragmentation of prior benchmarks. By integrating five benchmarks—Jeopardy-style CTFs, Attack-and-Defense CTFs, Cyber Range, knowledge tests, and privacy assessments—CAIBench enables simultaneous offensive and defensive evaluation, including robotics-focused challenges (RCTF2) and GDPR-aligned privacy metrics (CyberPII-Bench). Empirical results show strong knowledge performance but limited practical capability in multi-step, adversarial, and robotics scenarios, with substantial influence from framework choices and task decomposition, revealing a persistent gap between theory and real-world security labor. The work highlights the need for continual benchmark evolution and integrated evaluation pipelines to better align AI capabilities with professional cybersecurity practice, while providing a reproducible framework for researchers and practitioners. The combination of CAIBench and CAI represents a step toward scalable, labor-relevant assessment of cybersecurity AI systems, guiding future research toward more capable and trustworthy autonomous security agents.

Abstract

Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks assess isolated skills rather than integrated performance. We find that pre-trained knowledge of cybersecurity in LLMs does not imply attack and defense abilities, revealing a gap between knowledge and capability. To address this limitation, we present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework that allows evaluating LLM models and agents across offensive and defensive cybersecurity domains, taking a step towards meaningfully measuring their labor-relevance. CAIBench integrates five evaluation categories, covering over 10,000 instances: Jeopardy-style CTFs, Attack and Defense CTFs, Cyber Range exercises, knowledge benchmarks, and privacy assessments. Key novel contributions include systematic simultaneous offensive-defensive evaluation, robotics-focused cybersecurity challenges (RCTF2), and privacy-preserving performance assessment (CyberPII-Bench). Evaluation of state-of-the-art AI models reveals saturation on security knowledge metrics (~70\% success) but substantial degradation in multi-step adversarial (A\&D) scenarios (20-40\% success), or worse in robotic targets (22\% success). The combination of framework scaffolding and LLM model choice significantly impacts performance; we find that proper matches improve up to 2.6 variance in Attack and Defense CTFs. These results demonstrate a pronounced gap between conceptual knowledge and adaptive capability, emphasizing the need for a meta-benchmark.

Paper Structure

This paper contains 45 sections, 4 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: CAIBench categories: A meta-benchmark integrating five categories for cybersecurity evaluation.
  • Figure 2: Architecture of the CAIBench Meta-benchmark Framework. The framework is organized into three main branches: Categories, Difficulty, and Infrastructure. The Categories branch includes multiple benchmarks (Jeopardy CTF, A&D CTF, Cyber Range, Knowledge Bench, Privacy Bench). The Difficulty branch groups benchmarks by skill level, while the Infrastructure branch distinguishes between Docker-based and scripted implementations. Each benchmark is associated with the type of infrastructure and the number of instances or question they have, providing a detailed overview of the framework's composition.
  • Figure 5: Overall benchmark results across cybersecurity key categories: (a) privacy, (b) knowledge, (c) jeopardy CTF, (d) Attack and Defense scenarios and (e) Cyber Range CTF. For this overview, precision and model are the consider metrics for privacy and A&D, other subcategories and metrics are omitted for clarity. The other values are the average performance of the detailed results reported in Table \ref{['tab:combined-benchmarks']}. Overall, models excel at knowledge (70–89%) but fail at execution (20–50%).
  • Figure 6: Heatmap Benchmarking CAI Across LLMs in Base benchmark with 23 selected challenges. The heatmap illustrates the performance of different Large Language Models (LLMs) used on Base CTF Benchmark (\ref{['anex:base_challenges']}) using $pass_{100}@1$ and run in a Kali Linux (Rolling) environment. Basic CTFs have reached saturation.
  • Figure 7: Heatmap Benchmarking CAI Across LLMs in Cybench: Model Performance vs. Cybench CTF Challenges. The heatmap illustrates the performance of different models used on Cybench Benchmark (\ref{['anex:cybench_challenges']}) using $pass_{100}@1$ metric and run in a Kali Linux (Rolling) environment. Performance drops from 75% on basics to 46% on complex attacks.
  • ...and 11 more figures