Table of Contents
Fetching ...

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Andrey Anurin, Jonathan Ng, Kibo Schaffer, Jason Schreiber, Esben Kran

TL;DR

<3-5 sentence high-level summary> The paper introduces the Catastrophic Cyber Capabilities Benchmark (3CB), a framework to rigorously assess the real-world offensive cyber capabilities of LLM agents using MITRE ATT&CK-aligned challenges. It integrates a reusable harness and a 15-task challenge set to enable reproducible, sandboxed evaluations across frontier and open-source models, highlighting substantial performance and elicitation sensitivity in frontier models. The study finds that models like GPT-4o and Claude 3.5 Sonnet can autonomously perform complex offensive tasks under suitable elicitation, while smaller models show limited capabilities, underscoring safety and regulatory concerns. By open-sourcing 3CB, the authors aim to bridge the gap between rapidly advancing AI capabilities and robust risk assessment, informing researchers, developers, and policymakers on mitigation strategies and responsible deployment.

Abstract

LLM agents have the potential to revolutionize defensive cyber operations, but their offensive capabilities are not yet fully understood. To prepare for emerging threats, model developers and governments are evaluating the cyber capabilities of foundation models. However, these assessments often lack transparency and a comprehensive focus on offensive capabilities. In response, we introduce the Catastrophic Cyber Capabilities Benchmark (3CB), a novel framework designed to rigorously assess the real-world offensive capabilities of LLM agents. Our evaluation of modern LLMs on 3CB reveals that frontier models, such as GPT-4o and Claude 3.5 Sonnet, can perform offensive tasks such as reconnaissance and exploitation across domains ranging from binary analysis to web technologies. Conversely, smaller open-source models exhibit limited offensive capabilities. Our software solution and the corresponding benchmark provides a critical tool to reduce the gap between rapidly improving capabilities and robustness of cyber offense evaluations, aiding in the safer deployment and regulation of these powerful technologies.

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

TL;DR

<3-5 sentence high-level summary> The paper introduces the Catastrophic Cyber Capabilities Benchmark (3CB), a framework to rigorously assess the real-world offensive cyber capabilities of LLM agents using MITRE ATT&CK-aligned challenges. It integrates a reusable harness and a 15-task challenge set to enable reproducible, sandboxed evaluations across frontier and open-source models, highlighting substantial performance and elicitation sensitivity in frontier models. The study finds that models like GPT-4o and Claude 3.5 Sonnet can autonomously perform complex offensive tasks under suitable elicitation, while smaller models show limited capabilities, underscoring safety and regulatory concerns. By open-sourcing 3CB, the authors aim to bridge the gap between rapidly advancing AI capabilities and robust risk assessment, informing researchers, developers, and policymakers on mitigation strategies and responsible deployment.

Abstract

LLM agents have the potential to revolutionize defensive cyber operations, but their offensive capabilities are not yet fully understood. To prepare for emerging threats, model developers and governments are evaluating the cyber capabilities of foundation models. However, these assessments often lack transparency and a comprehensive focus on offensive capabilities. In response, we introduce the Catastrophic Cyber Capabilities Benchmark (3CB), a novel framework designed to rigorously assess the real-world offensive capabilities of LLM agents. Our evaluation of modern LLMs on 3CB reveals that frontier models, such as GPT-4o and Claude 3.5 Sonnet, can perform offensive tasks such as reconnaissance and exploitation across domains ranging from binary analysis to web technologies. Conversely, smaller open-source models exhibit limited offensive capabilities. Our software solution and the corresponding benchmark provides a critical tool to reduce the gap between rapidly improving capabilities and robustness of cyber offense evaluations, aiding in the safer deployment and regulation of these powerful technologies.

Paper Structure

This paper contains 23 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Our sshhijack challenge is designed from technique T1563 in the 10th step of the ATT&CK categorization, 'Lateral Movement'. On the right, run 33952 is finishing after our agent configuration (GPT-4o using Markdown in this case) reveals the flag in the terminal from a remote service.
  • Figure 2: An overview of how many challenges out of 15 each model was able to complete from our 3CB Benchmark. Note that the o1 family models display limited performance due to aggressive safety filtering.
  • Figure 3: Our challenges, each based on a MITRE ATT&CK sub-technique, consist of a specific cyber offense task within a controlled sandbox environment, where the AI agent must find a hidden "flag" (a secret string) by successfully executing the relevant technique.
  • Figure 4: The rate of completion for the best elicitation for all models across all challenges. Each combination is run 10 to 228 times with an average of 37 and a median of 20. See Table \ref{['tab:ctf-challenges']} for a description of each challenge. $\sim$ indicates combinations of challenges and models that have not been tested. Look through these challenges runs at https://cybercapabilities.org.
  • Figure 5: Completion rate by the agent's communication protocol to formulate commands for the environment. There is no straightforward reason why some models have large differences and some do not. X* and M* mark pairwise significance compared to XML and Markdown, respectively.
  • ...and 1 more figures