Table of Contents
Fetching ...

CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

Lauren Deason, Adam Bali, Ciprian Bejean, Diana Bolocan, James Crnkovich, Ioana Croitoru, Krishna Durai, Chase Midler, Calin Miron, David Molnar, Brad Moon, Bruno Ostarcevic, Alberto Peltea, Matt Rosenberg, Catalin Sandu, Arthur Saputkin, Sagar Shah, Daniel Stan, Ernest Szocs, Shengye Wan, Spencer Whitman, Sven Krasser, Joshua Saxe

TL;DR

CyberSOCEval presents the first open source benchmark suite targeted at SOC-relevant reasoning by LLMs, focusing on Malware Analysis and Threat Intelligence Reasoning. The study demonstrates that larger, modern LLMs generally perform better in these domains, but there is no consistent, large boost from inference-time reasoning within cybersecurity tasks, indicating a need for domain-specific training and data. The benchmarks are designed for automation, with carefully generated, expert-validated data, multimodal threat reports, and extensive baselines, revealing a meaningful gap between current models and practical cyber defense capabilities. Overall, CyberSOCEval provides a practical, extensible framework to guide model development and tool adoption in AI-augmented security operations.

Abstract

Today's cyber defenders are overwhelmed by a deluge of security alerts, threat intelligence signals, and shifting business context, creating an urgent need for AI systems to enhance operational security work. While Large Language Models (LLMs) have the potential to automate and scale Security Operations Center (SOC) operations, existing evaluations do not fully assess the scenarios most relevant to real-world defenders. This lack of informed evaluation impacts both AI developers and those applying LLMs to SOC automation. Without clear insight into LLM performance in real-world security scenarios, developers lack a north star for development, and users cannot reliably select the most effective models. Meanwhile, malicious actors are using AI to scale cyber attacks, highlighting the need for open source benchmarks to drive adoption and community-driven improvement among defenders and model developers. To address this, we introduce CyberSOCEval, a new suite of open source benchmarks within CyberSecEval 4. CyberSOCEval includes benchmarks tailored to evaluate LLMs in two tasks: Malware Analysis and Threat Intelligence Reasoning--core defensive domains with inadequate coverage in current benchmarks. Our evaluations show that larger, more modern LLMs tend to perform better, confirming the training scaling laws paradigm. We also find that reasoning models leveraging test time scaling do not achieve the same boost as in coding and math, suggesting these models have not been trained to reason about cybersecurity analysis, and pointing to a key opportunity for improvement. Finally, current LLMs are far from saturating our evaluations, showing that CyberSOCEval presents a significant challenge for AI developers to improve cyber defense capabilities.

CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

TL;DR

CyberSOCEval presents the first open source benchmark suite targeted at SOC-relevant reasoning by LLMs, focusing on Malware Analysis and Threat Intelligence Reasoning. The study demonstrates that larger, modern LLMs generally perform better in these domains, but there is no consistent, large boost from inference-time reasoning within cybersecurity tasks, indicating a need for domain-specific training and data. The benchmarks are designed for automation, with carefully generated, expert-validated data, multimodal threat reports, and extensive baselines, revealing a meaningful gap between current models and practical cyber defense capabilities. Overall, CyberSOCEval provides a practical, extensible framework to guide model development and tool adoption in AI-augmented security operations.

Abstract

Today's cyber defenders are overwhelmed by a deluge of security alerts, threat intelligence signals, and shifting business context, creating an urgent need for AI systems to enhance operational security work. While Large Language Models (LLMs) have the potential to automate and scale Security Operations Center (SOC) operations, existing evaluations do not fully assess the scenarios most relevant to real-world defenders. This lack of informed evaluation impacts both AI developers and those applying LLMs to SOC automation. Without clear insight into LLM performance in real-world security scenarios, developers lack a north star for development, and users cannot reliably select the most effective models. Meanwhile, malicious actors are using AI to scale cyber attacks, highlighting the need for open source benchmarks to drive adoption and community-driven improvement among defenders and model developers. To address this, we introduce CyberSOCEval, a new suite of open source benchmarks within CyberSecEval 4. CyberSOCEval includes benchmarks tailored to evaluate LLMs in two tasks: Malware Analysis and Threat Intelligence Reasoning--core defensive domains with inadequate coverage in current benchmarks. Our evaluations show that larger, more modern LLMs tend to perform better, confirming the training scaling laws paradigm. We also find that reasoning models leveraging test time scaling do not achieve the same boost as in coding and math, suggesting these models have not been trained to reason about cybersecurity analysis, and pointing to a key opportunity for improvement. Finally, current LLMs are far from saturating our evaluations, showing that CyberSOCEval presents a significant challenge for AI developers to improve cyber defense capabilities.

Paper Structure

This paper contains 27 sections, 10 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: Overview of CyberSOCEval, our proposed benchmark suite for evaluating Large Language Models (LLMs) in Security Operations Center (SOC) tasks. The suite consists of two complementary benchmarks: Malware Analysis and Threat Intelligence Reasoning, which together assess an LLM's ability to support key SOC functions.
  • Figure 2: The steps followed for our Malware Analysis benchmark to generate and refine question-answer pairs using LLMs, and to manually validate.
  • Figure 3: A notional, shorter example of our Malware Analysis test cases. Our real test cases contain fully detailed sandbox detonation JSON files alongside multiple choice questions that exercise the AI system under test’s Malware Analysis abilities.
  • Figure 4: Distribution of malware families covered by detonation reports included in the Malware Analysis benchmark (left), and distribution of question-answer pairs by topic and difficulty rating (right).
  • Figure 5: The steps followed for our Threat Intelligence benchmark to generate question-answer pairs using LLMs, and to manually validate.
  • ...and 17 more figures