Table of Contents
Fetching ...

The Digital Cybersecurity Expert: How Far Have We Come?

Dawei Wang, Geng Zhou, Xianglong Li, Yu Bai, Li Chen, Ting Qin, Jian Sun, Dan Li

TL;DR

This work introduces CSEBenchmark, a cognitive science-based cybersecurity knowledge framework comprising 345 fine-grained knowledge points across seven subdomains, compiled into 11,050 high-quality MCQs. It systematically evaluates 12 state-of-the-art LLMs, revealing that no model yet reaches full expert proficiency (top accuracy = 85.42%), with pronounced gaps in procedural knowledge and in the use of specialized tools. By identifying model-specific knowledge gaps and applying Retrieval-Augmented Generation (RAG) to inject relevant knowledge during inference, the authors achieve up to 84% correction of previously erroneous predictions on three cybersecurity benchmarks, demonstrating the practical value of targeted knowledge augmentation. The study also assesses alignment with six real-world cybersecurity job roles, finding partial, role-dependent matches and underscoring the need for role-specific model selection and further improvements before LLMs can reliably assume expert cybersecurity duties.

Abstract

The increasing deployment of large language models (LLMs) in the cybersecurity domain underscores the need for effective model selection and evaluation. However, traditional evaluation methods often overlook specific cybersecurity knowledge gaps that contribute to performance limitations. To address this, we develop CSEBenchmark, a fine-grained cybersecurity evaluation framework based on 345 knowledge points expected of cybersecurity experts. Drawing from cognitive science, these points are categorized into factual, conceptual, and procedural types, enabling the design of 11,050 tailored multiple-choice questions. We evaluate 12 popular LLMs on CSEBenchmark and find that even the best-performing model achieves only 85.42% overall accuracy, with particular knowledge gaps in the use of specialized tools and uncommon commands. Different LLMs have unique knowledge gaps. Even large models from the same family may perform poorly on knowledge points where smaller models excel. By identifying and addressing specific knowledge gaps in each LLM, we achieve up to an 84% improvement in correcting previously incorrect predictions across three existing benchmarks for two cybersecurity tasks. Furthermore, our assessment of each LLM's knowledge alignment with specific cybersecurity roles reveals that different models align better with different roles, such as GPT-4o for the Google Senior Intelligence Analyst and Deepseek-V3 for the Amazon Privacy Engineer. These findings underscore the importance of aligning LLM selection with the specific knowledge requirements of different cybersecurity roles for optimal performance.

The Digital Cybersecurity Expert: How Far Have We Come?

TL;DR

This work introduces CSEBenchmark, a cognitive science-based cybersecurity knowledge framework comprising 345 fine-grained knowledge points across seven subdomains, compiled into 11,050 high-quality MCQs. It systematically evaluates 12 state-of-the-art LLMs, revealing that no model yet reaches full expert proficiency (top accuracy = 85.42%), with pronounced gaps in procedural knowledge and in the use of specialized tools. By identifying model-specific knowledge gaps and applying Retrieval-Augmented Generation (RAG) to inject relevant knowledge during inference, the authors achieve up to 84% correction of previously erroneous predictions on three cybersecurity benchmarks, demonstrating the practical value of targeted knowledge augmentation. The study also assesses alignment with six real-world cybersecurity job roles, finding partial, role-dependent matches and underscoring the need for role-specific model selection and further improvements before LLMs can reliably assume expert cybersecurity duties.

Abstract

The increasing deployment of large language models (LLMs) in the cybersecurity domain underscores the need for effective model selection and evaluation. However, traditional evaluation methods often overlook specific cybersecurity knowledge gaps that contribute to performance limitations. To address this, we develop CSEBenchmark, a fine-grained cybersecurity evaluation framework based on 345 knowledge points expected of cybersecurity experts. Drawing from cognitive science, these points are categorized into factual, conceptual, and procedural types, enabling the design of 11,050 tailored multiple-choice questions. We evaluate 12 popular LLMs on CSEBenchmark and find that even the best-performing model achieves only 85.42% overall accuracy, with particular knowledge gaps in the use of specialized tools and uncommon commands. Different LLMs have unique knowledge gaps. Even large models from the same family may perform poorly on knowledge points where smaller models excel. By identifying and addressing specific knowledge gaps in each LLM, we achieve up to an 84% improvement in correcting previously incorrect predictions across three existing benchmarks for two cybersecurity tasks. Furthermore, our assessment of each LLM's knowledge alignment with specific cybersecurity roles reveals that different models align better with different roles, such as GPT-4o for the Google Senior Intelligence Analyst and Deepseek-V3 for the Amazon Privacy Engineer. These findings underscore the importance of aligning LLM selection with the specific knowledge requirements of different cybersecurity roles for optimal performance.

Paper Structure

This paper contains 28 sections, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of the construction process of CSEBenchmark.
  • Figure 2: Accuracy distribution of LLMs across subdomains.
  • Figure 3: Accuracy distribution of LLMs across knowledge categories.
  • Figure 4: Heatmap of accuracy across 345 knowledge points for 12 models. The y-axis labels denote individual knowledge points, with subdomain names in parentheses for grouped items. Each section contains 12 columns representing models from left to right: GPT-4o, Deepseek-V3, Qwen-2.5-72B, GPT-4-Turbo, Deepseek-R1, Llama-3.1-70B, Qwen-2.5-7B, Mixtral-8x7B, GPT-3.5-Turbo, Llama-3.1-8B, Qwen-2.5-3B, Llama-3.2-3B.
  • Figure 5: Proportion of knowledge points across four accuracy ranges for each LLM.
  • ...and 4 more figures