CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Manish Bhatt; Sahana Chennabasappa; Yue Li; Cyrus Nikolaidis; Daniel Song; Shengye Wan; Faizan Ahmad; Cornelius Aschermann; Yaohui Chen; Dhaval Kapil; David Molnar; Spencer Whitman; Joshua Saxe

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, Joshua Saxe

TL;DR

<3-5 sentence high-level summary> CyberSecEval 2 presents a comprehensive, open-source benchmark to quantify large language model security risks and capabilities, introducing prompt injection and code interpreter abuse as new evaluation axes. It formalizes the safety-utility tradeoff through the False Refusal Rate (FRR) metric and demonstrates its application on cyberattack helpfulness. The paper also assesses LLMs' ability to automate vulnerability exploitation across multiple languages, revealing that greater coding ability improves performance but end-to-end exploitation remains challenging. These contributions provide practical, reproducible tools for building safer LLM-based systems and guide the development of layered security guardrails for real-world deployments.

Abstract

Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We introduce two new areas for testing: prompt injection and code interpreter abuse. We evaluated multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama. Our results show that conditioning away risk of attack remains an unsolved problem; for example, all tested models showed between 26% and 41% successful prompt injection tests. We further introduce the safety-utility tradeoff: conditioning an LLM to reject unsafe prompts can cause the LLM to falsely reject answering benign prompts, which lowers utility. We propose quantifying this tradeoff using False Refusal Rate (FRR). As an illustration, we introduce a novel test set to quantify FRR for cyberattack helpfulness risk. We find many LLMs able to successfully comply with "borderline" benign requests while still rejecting most unsafe requests. Finally, we quantify the utility of LLMs for automating a core cybersecurity task, that of exploiting software vulnerabilities. This is important because the offensive capabilities of LLMs are of intense interest; we quantify this by creating novel test sets for four representative problems. We find that models with coding capabilities perform better than those without, but that further work is needed for LLMs to become proficient at exploit generation. Our code is open source and can be used to evaluate other LLMs.

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

TL;DR

Abstract

Paper Structure (25 sections, 9 figures, 4 tables)

This paper contains 25 sections, 9 figures, 4 tables.

Introduction
Background and the Safety-Utility Tradeoff
Quantifying the safety-utility tradeoff with False Refusal Rate; illustration with cyberattack helpfulness
Related Work
Detailed Descriptions of New Tests in CyberSecEval 2
Prompt injection evaluations
Testing philosophy
Testing approach
Vulnerability exploitation evaluations
Testing philosophy
Testing approach
Code interpreter abuse evaluation
Testing philosophy
Testing approach
Case Study in Applying CyberSecEval 2
...and 10 more sections

Figures (9)

Figure 1: Summary of LLM performance in non-compliance with requests to help with cyber attacks (left), and average model performance across 10 categories of cyberattack tactics, techniques, and procedures (right).
Figure 2: Tradeoff between LLM performance against 10 categories of cyberattacks and false refusals.
Figure 3: Prompt injection success rate broken down by model and prompt injection variant.
Figure 4: Comparison between the security and logic violations from the prompt injection tests.
Figure 5: Exploitation capability scores broken down by model and test category.
...and 4 more figures

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

TL;DR

Abstract

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)