Table of Contents
Fetching ...

SECURE: Benchmarking Large Language Models for Cybersecurity

Dipkamal Bhusal, Md Tanvirul Alam, Le Nguyen, Ashim Mahara, Zachary Lightcap, Rodney Frazier, Romy Fieblinger, Grace Long Torales, Benjamin A. Blakely, Nidhi Rastogi

TL;DR

This study evaluates seven state-of-the-art models on these tasks, providing insights into their strengths and weaknesses in cybersecurity contexts, and introduces the SECURE benchmark, a benchmark designed to assess LLMs performance in realistic cybersecurity scenarios.

Abstract

Large Language Models (LLMs) have demonstrated potential in cybersecurity applications but have also caused lower confidence due to problems like hallucinations and a lack of truthfulness. Existing benchmarks provide general evaluations but do not sufficiently address the practical and applied aspects of LLM performance in cybersecurity-specific tasks. To address this gap, we introduce the SECURE (Security Extraction, Understanding \& Reasoning Evaluation), a benchmark designed to assess LLMs performance in realistic cybersecurity scenarios. SECURE includes six datasets focussed on the Industrial Control System sector to evaluate knowledge extraction, understanding, and reasoning based on industry-standard sources. Our study evaluates seven state-of-the-art models on these tasks, providing insights into their strengths and weaknesses in cybersecurity contexts, and offer recommendations for improving LLMs reliability as cyber advisory tools.

SECURE: Benchmarking Large Language Models for Cybersecurity

TL;DR

This study evaluates seven state-of-the-art models on these tasks, providing insights into their strengths and weaknesses in cybersecurity contexts, and introduces the SECURE benchmark, a benchmark designed to assess LLMs performance in realistic cybersecurity scenarios.

Abstract

Large Language Models (LLMs) have demonstrated potential in cybersecurity applications but have also caused lower confidence due to problems like hallucinations and a lack of truthfulness. Existing benchmarks provide general evaluations but do not sufficiently address the practical and applied aspects of LLM performance in cybersecurity-specific tasks. To address this gap, we introduce the SECURE (Security Extraction, Understanding \& Reasoning Evaluation), a benchmark designed to assess LLMs performance in realistic cybersecurity scenarios. SECURE includes six datasets focussed on the Industrial Control System sector to evaluate knowledge extraction, understanding, and reasoning based on industry-standard sources. Our study evaluates seven state-of-the-art models on these tasks, providing insights into their strengths and weaknesses in cybersecurity contexts, and offer recommendations for improving LLMs reliability as cyber advisory tools.
Paper Structure (39 sections, 11 figures, 8 tables)

This paper contains 39 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Conversation between a user and a ChatGPT-3.5 on ICS security questions.
  • Figure 2: Overview of the SECURE benchmark.
  • Figure 3: Confidence vs. Accuracy of LLMs
  • Figure 4: Performance Comparison of Open-Source and Closed-Source LLMs Across Different Tasks (Note: For CPST, lower scores indicate better performance).
  • Figure 5: Performance of LLMs when asked to perform step-by-step analysis
  • ...and 6 more figures