CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models
Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, Joshua Saxe
TL;DR
CyberSecEval 3 advances empirical measurement of LLM cybersecurity risks by introducing eight risks across two categories (third-party and developer-facing) and extending evaluation to offensive capabilities, including automated social engineering, uplift of manual cyber operations, and autonomous cyber operations. The paper provides rigorous assessment methods, including spear-phishing simulations, two-stage uplift experiments, and autonomous attack trials, and presents guardrails (Prompt Guard, Code Shield, Llama Guard 3) that mitigate many identified risks. Key findings show Llama 3 405B can perform moderately persuasive spear-phishing and contribute to vulnerability exploitation tasks, but autonomous capabilities remain limited; guardrails substantially reduce risk, though no defense is perfect. The authors release Guardrails publicly and discuss limitations and future work to enable ongoing, time-based risk assessment as models evolve, supporting safer deployment of LLMs in real-world security-sensitive contexts.
Abstract
We are releasing a new suite of security benchmarks for LLMs, CYBERSECEVAL 3, to continue the conversation on empirically measuring LLM cybersecurity risks and capabilities. CYBERSECEVAL 3 assesses 8 different risks across two broad categories: risk to third parties, and risk to application developers and end users. Compared to previous work, we add new areas focused on offensive security capabilities: automated social engineering, scaling manual offensive cyber operations, and autonomous offensive cyber operations. In this paper we discuss applying these benchmarks to the Llama 3 models and a suite of contemporaneous state-of-the-art LLMs, enabling us to contextualize risks both with and without mitigations in place.
