eyeballvul: a future-proof benchmark for vulnerability detection in the wild

Timothee Chauvin

eyeballvul: a future-proof benchmark for vulnerability detection in the wild

Timothee Chauvin

TL;DR

Eyeballvul addresses the challenge of evaluating vulnerability detection in large codebases by introducing a real-world, continuously updated benchmark built from CVEs sourced from open repositories. The authors repack CVEs into per-revision ground truth via a minimum hitting set approach, enabling long-context LLMs to be evaluated on their ability to propose credible leads, with ground-truth matching performed by an LLM-based scorer trained to map leads to known CVEs. Across seven long-context models, the results reveal substantial room for improvement, with the best F1 around $14.1\%$ and a high false-positive rate, indicating that current models struggle to saturate the benchmark. The work also analyzes vulnerability types and CVSS severities, shows cost is dominated by false positives, and discusses data quality and alternative scoring paradigms, highlighting defender advantages and guiding future research toward more robust tooling and evaluation frameworks.

Abstract

Long contexts of recent LLMs have enabled a new use case: asking models to find security vulnerabilities in entire codebases. To evaluate model performance on this task, we introduce eyeballvul: a benchmark designed to test the vulnerability detection capabilities of language models at scale, that is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories. The benchmark consists of a list of revisions in different repositories, each associated with the list of known vulnerabilities present at that revision. An LLM-based scorer is used to compare the list of possible vulnerabilities returned by a model to the list of known vulnerabilities for each revision. As of July 2024, eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.

eyeballvul: a future-proof benchmark for vulnerability detection in the wild

TL;DR

and a high false-positive rate, indicating that current models struggle to saturate the benchmark. The work also analyzes vulnerability types and CVSS severities, shows cost is dominated by false positives, and discusses data quality and alternative scoring paradigms, highlighting defender advantages and guiding future research toward more robust tooling and evaluation frameworks.

Abstract

Paper Structure (35 sections, 8 figures, 2 tables)

This paper contains 35 sections, 8 figures, 2 tables.

Introduction
Creating the benchmark
Procedure
Statistics on eyeballvul
Methodology
Processing revisions
LLM scorer
Results
Overall performance: significant room for improvement
Types and severities of vulnerabilities found
Better performance on superficial vulnerabilities.
Slightly better performance on more severe vulnerabilities.
Cost is dominated by false positives
Slight evidence of training data contamination
Smaller context windows don't explain the lower performance of GPT-4
...and 20 more sections

Figures (8)

Figure 1: Distribution of revisions and vulnerabilities by date
Figure 2: Number of revisions by size: many revisions fit within current models' long contexts
Figure 3: Precision, recall, and F1 score of models on the benchmark
Figure 4: Pareto efficiency plot of model performance
Figure 5: Top 10 most frequent CWEs among true positives, and their ranks in MITRE's 2023 CWE Top 25
...and 3 more figures

eyeballvul: a future-proof benchmark for vulnerability detection in the wild

TL;DR

Abstract

eyeballvul: a future-proof benchmark for vulnerability detection in the wild

Authors

TL;DR

Abstract

Table of Contents

Figures (8)