Table of Contents
Fetching ...

CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

Zhengmin Yu, Jiutian Zeng, Siyi Chen, Wenhan Xu, Dandan Xu, Xiangyu Liu, Zonghao Ying, Nan Wang, Yuan Zhang, Min Yang

TL;DR

CS-Eval introduces the first open-access bilingual benchmark for cybersecurity LLMs, spanning 11 categories and 42 subcategories with 4,369 questions across knowledge, ability, and application levels. It combines expert-curated static questions with dynamic data generation to maintain relevance and mitigate data contamination, enabling nuanced, domain-specific evaluation. Across a broad model suite, results show GPT-4 8K leads overall but domain-specialist and smaller, high-quality data-trained models can excel in specific tasks, underscoring the importance of data quality and targeted training. The benchmark's findings highlight the scaling law's role alongside data quality, demonstrate the growing parity between open-source and proprietary models, and offer practical guidance for future cybersecurity LLM development and benchmarking.

Abstract

Over the past year, there has been a notable rise in the use of large language models (LLMs) for academic research and industrial practices within the cybersecurity field. However, it remains a lack of comprehensive and publicly accessible benchmarks to evaluate the performance of LLMs on cybersecurity tasks. To address this gap, we introduce CS-Eval, a publicly accessible, comprehensive and bilingual LLM benchmark specifically designed for cybersecurity. CS-Eval synthesizes the research hotspots from academia and practical applications from industry, curating a diverse set of high-quality questions across 42 categories within cybersecurity, systematically organized into three cognitive levels: knowledge, ability, and application. Through an extensive evaluation of a wide range of LLMs using CS-Eval, we have uncovered valuable insights. For instance, while GPT-4 generally excels overall, other models may outperform it in certain specific subcategories. Additionally, by conducting evaluations over several months, we observed significant improvements in many LLMs' abilities to solve cybersecurity tasks. The benchmarks are now publicly available at https://github.com/CS-EVAL/CS-Eval.

CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

TL;DR

CS-Eval introduces the first open-access bilingual benchmark for cybersecurity LLMs, spanning 11 categories and 42 subcategories with 4,369 questions across knowledge, ability, and application levels. It combines expert-curated static questions with dynamic data generation to maintain relevance and mitigate data contamination, enabling nuanced, domain-specific evaluation. Across a broad model suite, results show GPT-4 8K leads overall but domain-specialist and smaller, high-quality data-trained models can excel in specific tasks, underscoring the importance of data quality and targeted training. The benchmark's findings highlight the scaling law's role alongside data quality, demonstrate the growing parity between open-source and proprietary models, and offer practical guidance for future cybersecurity LLM development and benchmarking.

Abstract

Over the past year, there has been a notable rise in the use of large language models (LLMs) for academic research and industrial practices within the cybersecurity field. However, it remains a lack of comprehensive and publicly accessible benchmarks to evaluate the performance of LLMs on cybersecurity tasks. To address this gap, we introduce CS-Eval, a publicly accessible, comprehensive and bilingual LLM benchmark specifically designed for cybersecurity. CS-Eval synthesizes the research hotspots from academia and practical applications from industry, curating a diverse set of high-quality questions across 42 categories within cybersecurity, systematically organized into three cognitive levels: knowledge, ability, and application. Through an extensive evaluation of a wide range of LLMs using CS-Eval, we have uncovered valuable insights. For instance, while GPT-4 generally excels overall, other models may outperform it in certain specific subcategories. Additionally, by conducting evaluations over several months, we observed significant improvements in many LLMs' abilities to solve cybersecurity tasks. The benchmarks are now publicly available at https://github.com/CS-EVAL/CS-Eval.

Paper Structure

This paper contains 25 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Overview of Fields Covered by CS-Eval: A comprehensive cybersecurity benchmark encompassing 11 categories and 42 subcategories across various domains.
  • Figure 2: CS-Eval Data Collection Process
  • Figure 3: The Average Scores of Models with Different Parameter Sizes.
  • Figure 4: Performance Scores Across Different Generations of Models.
  • Figure 5: Self-instruct prompt used for generating Chinese questions.