Table of Contents
Fetching ...

SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity

Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, Xiapu Luo

TL;DR

SecBench tackles the lack of domain-specific LLM benchmarks for cybersecurity by introducing a large-scale, multi-dimensional dataset that features MCQs and SAQs across two capability levels (Knowledge Retention and Logical Reasoning), two languages (Chinese and English), and nine cybersecurity domains. It combines open-source data with a Cybersecurity Question Design Contest, yielding 44,823 MCQs and 3,087 SAQs, and uses GPT-4 for labeling and GPT-4o-mini as an automatic SAQ grader. Benchmarking across 16 state-of-the-art LLMs demonstrates the dataset’s utility and scale, with findings that KR is generally easier than LR and that larger models achieve higher accuracy, notably Hunyuan-Turbo in MCQs. SecBench thus provides a robust, domain-focused platform for evaluating and improving cybersecurity-oriented LLMs, with potential to guide domain-specific alignment and future multilingual expansion.

Abstract

Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations across various applications, including natural language processing and code generation. Existing benchmarks like MMLU, C-Eval, and HumanEval assess general LLM performance but lack focus on specific expert domains such as cybersecurity. Previous attempts to create cybersecurity datasets have faced limitations, including insufficient data volume and a reliance on multiple-choice questions (MCQs). To address these gaps, we propose SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in the cybersecurity domain. SecBench includes questions in various formats (MCQs and short-answer questions (SAQs)), at different capability levels (Knowledge Retention and Logical Reasoning), in multiple languages (Chinese and English), and across various sub-domains. The dataset was constructed by collecting high-quality data from open sources and organizing a Cybersecurity Question Design Contest, resulting in 44,823 MCQs and 3,087 SAQs. Particularly, we used the powerful while cost-effective LLMs to (1). label the data and (2). constructing a grading agent for automatic evaluation of SAQs. Benchmarking results on 16 SOTA LLMs demonstrate the usability of SecBench, which is arguably the largest and most comprehensive benchmark dataset for LLMs in cybersecurity. More information about SecBench can be found at our website, and the dataset can be accessed via the artifact link.

SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity

TL;DR

SecBench tackles the lack of domain-specific LLM benchmarks for cybersecurity by introducing a large-scale, multi-dimensional dataset that features MCQs and SAQs across two capability levels (Knowledge Retention and Logical Reasoning), two languages (Chinese and English), and nine cybersecurity domains. It combines open-source data with a Cybersecurity Question Design Contest, yielding 44,823 MCQs and 3,087 SAQs, and uses GPT-4 for labeling and GPT-4o-mini as an automatic SAQ grader. Benchmarking across 16 state-of-the-art LLMs demonstrates the dataset’s utility and scale, with findings that KR is generally easier than LR and that larger models achieve higher accuracy, notably Hunyuan-Turbo in MCQs. SecBench thus provides a robust, domain-focused platform for evaluating and improving cybersecurity-oriented LLMs, with potential to guide domain-specific alignment and future multilingual expansion.

Abstract

Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations across various applications, including natural language processing and code generation. Existing benchmarks like MMLU, C-Eval, and HumanEval assess general LLM performance but lack focus on specific expert domains such as cybersecurity. Previous attempts to create cybersecurity datasets have faced limitations, including insufficient data volume and a reliance on multiple-choice questions (MCQs). To address these gaps, we propose SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in the cybersecurity domain. SecBench includes questions in various formats (MCQs and short-answer questions (SAQs)), at different capability levels (Knowledge Retention and Logical Reasoning), in multiple languages (Chinese and English), and across various sub-domains. The dataset was constructed by collecting high-quality data from open sources and organizing a Cybersecurity Question Design Contest, resulting in 44,823 MCQs and 3,087 SAQs. Particularly, we used the powerful while cost-effective LLMs to (1). label the data and (2). constructing a grading agent for automatic evaluation of SAQs. Benchmarking results on 16 SOTA LLMs demonstrate the usability of SecBench, which is arguably the largest and most comprehensive benchmark dataset for LLMs in cybersecurity. More information about SecBench can be found at our website, and the dataset can be accessed via the artifact link.
Paper Structure (14 sections, 5 figures, 2 tables)

This paper contains 14 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: SecBench: A multi-level, multi-language, multi-form, and multi-domain benchmarking dataset for LLM in Cybersecurity.
  • Figure 2: SecBench: Dataset Construction.
  • Figure 3: The distribution of evaluation level, domain and language of the 44,823 MCQs.
  • Figure 4: The distribution of evaluation level, domain and language of the 3,087 SAQs.
  • Figure 5: SAQ evaluation process: A sufficiently powerful LLM is used as the agent to grade the model prediction.