CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Shengye Wan; Cyrus Nikolaidis; Daniel Song; David Molnar; James Crnkovich; Jayson Grace; Manish Bhatt; Sahana Chennabasappa; Spencer Whitman; Stephanie Ding; Vlad Ionescu; Yue Li; Joshua Saxe

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, Joshua Saxe

TL;DR

CyberSecEval 3 advances empirical measurement of LLM cybersecurity risks by introducing eight risks across two categories (third-party and developer-facing) and extending evaluation to offensive capabilities, including automated social engineering, uplift of manual cyber operations, and autonomous cyber operations. The paper provides rigorous assessment methods, including spear-phishing simulations, two-stage uplift experiments, and autonomous attack trials, and presents guardrails (Prompt Guard, Code Shield, Llama Guard 3) that mitigate many identified risks. Key findings show Llama 3 405B can perform moderately persuasive spear-phishing and contribute to vulnerability exploitation tasks, but autonomous capabilities remain limited; guardrails substantially reduce risk, though no defense is perfect. The authors release Guardrails publicly and discuss limitations and future work to enable ongoing, time-based risk assessment as models evolve, supporting safer deployment of LLMs in real-world security-sensitive contexts.

Abstract

We are releasing a new suite of security benchmarks for LLMs, CYBERSECEVAL 3, to continue the conversation on empirically measuring LLM cybersecurity risks and capabilities. CYBERSECEVAL 3 assesses 8 different risks across two broad categories: risk to third parties, and risk to application developers and end users. Compared to previous work, we add new areas focused on offensive security capabilities: automated social engineering, scaling manual offensive cyber operations, and autonomous offensive cyber operations. In this paper we discuss applying these benchmarks to the Llama 3 models and a suite of contemporaneous state-of-the-art LLMs, enabling us to contextualize risks both with and without mitigations in place.

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

TL;DR

Abstract

Paper Structure (74 sections, 20 figures, 7 tables)

This paper contains 74 sections, 20 figures, 7 tables.

Introduction
Summary of Findings
Paper structure
Related Work
On which risks to evaluate for new models
Assessment of risks to third parties
Assessment of risks to application developers
Assessment of offensive cybersecurity capabilities and risks to third parties
Risk: Automated social engineering via spear-phishing
Assessment strategy
Phishing simulation procedure
Manual grading rubric
Assessed risk
Risk: Scaling Manual Offensive Cyber Operations
Assessment strategy
...and 59 more sections

Figures (20)

Figure 1: Overview of risks evaluated, evaluation approach, our limitations, and our results in evaluating Llama 3 with CyberSecEval. We have publicly released all non-manual evaluation elements within CyberSecEval for transparency, reproducibility, and to encourage community contributions. We also publicly release all mentioned LLM guardrails, including CodeShield, PromptGuard, and LlamaGuard 3.
Figure 2: An example dialogue from our automated social engineering evaluation, between Llama 3 405b and an LLM-simulated phishing victim where the Llama 3 405b attacker reasons about the simulated victim's personal attributes to execute a strategy to persuade them to download and open a malicious attachment. We have added the highlight to the text at the bottom for emphasis.
Figure 3: Results from our automated social engineering evaluation. GPT-4 Turbo was evaluated by the judge LLM to be significantly more successful at achieving spear-phishing goals than Llama 3 405b and Mixtral 8x22B.
Figure 4: Automated scoring results per goal from our automated social engineering evaluation approach, showing success rate per model and persuasion goal. Higher values give evidence of stronger social engineering capabilities.
Figure 5: An example interaction between a human participant and Llama 3 405B during our offensive security uplift study.
...and 15 more figures

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

TL;DR

Abstract

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (20)