Table of Contents
Fetching ...

OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

Michael Kouremetis, Marissa Dotter, Alex Byrne, Dan Martin, Ethan Michalak, Gianpaolo Russo, Michael Threet, Guido Zarrella

TL;DR

OCCULT addresses the risk that AI-enabled LLMs could scale offensive cyber operations by providing a standardized, open-source framework to evaluate OCO capabilities in LLMs. It introduces three benchmarks—TACTL, BloodHound Equivalency, and CyberLayer—within an LLM Evaluation Platform that includes a Leaderboard and rapid test integration. Preliminary results show that newer models (e.g., DeepSeek-R1) can solve challenging offensive cyber knowledge tests with high accuracy, though trade-offs exist, such as longer inference times and varying performance across benchmarks. The paper argues for community-driven expansion of test cases and standardization to keep pace with rapid LLM development and to support cyber defenders in understanding potential risks and defense implications.

Abstract

The prospect of artificial intelligence (AI) competing in the adversarial landscape of cyber security has long been considered one of the most impactful, challenging, and potentially dangerous applications of AI. Here, we demonstrate a new approach to assessing AI's progress towards enabling and scaling real-world offensive cyber operations (OCO) tactics in use by modern threat actors. We detail OCCULT, a lightweight operational evaluation framework that allows cyber security experts to contribute to rigorous and repeatable measurement of the plausible cyber security risks associated with any given large language model (LLM) or AI employed for OCO. We also prototype and evaluate three very different OCO benchmarks for LLMs that demonstrate our approach and serve as examples for building benchmarks under the OCCULT framework. Finally, we provide preliminary evaluation results to demonstrate how this framework allows us to move beyond traditional all-or-nothing tests, such as those crafted from educational exercises like capture-the-flag environments, to contextualize our indicators and warnings in true cyber threat scenarios that present risks to modern infrastructure. We find that there has been significant recent advancement in the risks of AI being used to scale realistic cyber threats. For the first time, we find a model (DeepSeek-R1) is capable of correctly answering over 90% of challenging offensive cyber knowledge tests in our Threat Actor Competency Test for LLMs (TACTL) multiple-choice benchmarks. We also show how Meta's Llama and Mistral's Mixtral model families show marked performance improvements over earlier models against our benchmarks where LLMs act as offensive agents in MITRE's high-fidelity offensive and defensive cyber operations simulation environment, CyberLayer.

OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities

TL;DR

OCCULT addresses the risk that AI-enabled LLMs could scale offensive cyber operations by providing a standardized, open-source framework to evaluate OCO capabilities in LLMs. It introduces three benchmarks—TACTL, BloodHound Equivalency, and CyberLayer—within an LLM Evaluation Platform that includes a Leaderboard and rapid test integration. Preliminary results show that newer models (e.g., DeepSeek-R1) can solve challenging offensive cyber knowledge tests with high accuracy, though trade-offs exist, such as longer inference times and varying performance across benchmarks. The paper argues for community-driven expansion of test cases and standardization to keep pace with rapid LLM development and to support cyber defenders in understanding potential risks and defense implications.

Abstract

The prospect of artificial intelligence (AI) competing in the adversarial landscape of cyber security has long been considered one of the most impactful, challenging, and potentially dangerous applications of AI. Here, we demonstrate a new approach to assessing AI's progress towards enabling and scaling real-world offensive cyber operations (OCO) tactics in use by modern threat actors. We detail OCCULT, a lightweight operational evaluation framework that allows cyber security experts to contribute to rigorous and repeatable measurement of the plausible cyber security risks associated with any given large language model (LLM) or AI employed for OCO. We also prototype and evaluate three very different OCO benchmarks for LLMs that demonstrate our approach and serve as examples for building benchmarks under the OCCULT framework. Finally, we provide preliminary evaluation results to demonstrate how this framework allows us to move beyond traditional all-or-nothing tests, such as those crafted from educational exercises like capture-the-flag environments, to contextualize our indicators and warnings in true cyber threat scenarios that present risks to modern infrastructure. We find that there has been significant recent advancement in the risks of AI being used to scale realistic cyber threats. For the first time, we find a model (DeepSeek-R1) is capable of correctly answering over 90% of challenging offensive cyber knowledge tests in our Threat Actor Competency Test for LLMs (TACTL) multiple-choice benchmarks. We also show how Meta's Llama and Mistral's Mixtral model families show marked performance improvements over earlier models against our benchmarks where LLMs act as offensive agents in MITRE's high-fidelity offensive and defensive cyber operations simulation environment, CyberLayer.

Paper Structure

This paper contains 61 sections, 22 figures, 11 tables.

Figures (22)

  • Figure 1: Conceptual view of the OCCULT LLM Evaluation Methodology for OCO.
  • Figure 2: Conceptual view of the OCCULT LLM Evaluation Benchmark.
  • Figure 3: Sample TACTL question without variables reconciled (i.e. filled). The bold text represents the variables found in the question-and-answer options. The cyan highlighted cell represents the correct answer option, and the green highlighted cells represent the incorrect answer options.
  • Figure 4: Sample TACTL question with variables reconciled (i.e. filled). The bold text represents the variables found in the question-and-answer options. The cyan highlighted cell represents the correct answer option, and the green highlighted cells represent the incorrect answer options.
  • Figure 5: Example query with LLM response and BloodHound response.
  • ...and 17 more figures