SECURE: Benchmarking Large Language Models for Cybersecurity

Dipkamal Bhusal; Md Tanvirul Alam; Le Nguyen; Ashim Mahara; Zachary Lightcap; Rodney Frazier; Romy Fieblinger; Grace Long Torales; Benjamin A. Blakely; Nidhi Rastogi

SECURE: Benchmarking Large Language Models for Cybersecurity

Dipkamal Bhusal, Md Tanvirul Alam, Le Nguyen, Ashim Mahara, Zachary Lightcap, Rodney Frazier, Romy Fieblinger, Grace Long Torales, Benjamin A. Blakely, Nidhi Rastogi

TL;DR

This study evaluates seven state-of-the-art models on these tasks, providing insights into their strengths and weaknesses in cybersecurity contexts, and introduces the SECURE benchmark, a benchmark designed to assess LLMs performance in realistic cybersecurity scenarios.

Abstract

Large Language Models (LLMs) have demonstrated potential in cybersecurity applications but have also caused lower confidence due to problems like hallucinations and a lack of truthfulness. Existing benchmarks provide general evaluations but do not sufficiently address the practical and applied aspects of LLM performance in cybersecurity-specific tasks. To address this gap, we introduce the SECURE (Security Extraction, Understanding \& Reasoning Evaluation), a benchmark designed to assess LLMs performance in realistic cybersecurity scenarios. SECURE includes six datasets focussed on the Industrial Control System sector to evaluate knowledge extraction, understanding, and reasoning based on industry-standard sources. Our study evaluates seven state-of-the-art models on these tasks, providing insights into their strengths and weaknesses in cybersecurity contexts, and offer recommendations for improving LLMs reliability as cyber advisory tools.

SECURE: Benchmarking Large Language Models for Cybersecurity

TL;DR

Abstract

Paper Structure (39 sections, 11 figures, 8 tables)

This paper contains 39 sections, 11 figures, 8 tables.

Introduction
Background and Related Work
Proposed Benchmark: SECURE
Modeling
Data Sources and Tasks
Extraction Task
Understanding Task
Reasoning Task
Dataset Validation
Benchmark Dataset and Evaluation
Experiments & Results
Prompting Strategy for Evaluation
Evaluation Metrics
Results Summary
Discussion & Analysis
...and 24 more sections

Figures (11)

Figure 1: Conversation between a user and a ChatGPT-3.5 on ICS security questions.
Figure 2: Overview of the SECURE benchmark.
Figure 3: Confidence vs. Accuracy of LLMs
Figure 4: Performance Comparison of Open-Source and Closed-Source LLMs Across Different Tasks (Note: For CPST, lower scores indicate better performance).
Figure 5: Performance of LLMs when asked to perform step-by-step analysis
...and 6 more figures

SECURE: Benchmarking Large Language Models for Cybersecurity

TL;DR

Abstract

SECURE: Benchmarking Large Language Models for Cybersecurity

Authors

TL;DR

Abstract

Table of Contents

Figures (11)