eyeballvul: a future-proof benchmark for vulnerability detection in the wild
Timothee Chauvin
TL;DR
Eyeballvul addresses the challenge of evaluating vulnerability detection in large codebases by introducing a real-world, continuously updated benchmark built from CVEs sourced from open repositories. The authors repack CVEs into per-revision ground truth via a minimum hitting set approach, enabling long-context LLMs to be evaluated on their ability to propose credible leads, with ground-truth matching performed by an LLM-based scorer trained to map leads to known CVEs. Across seven long-context models, the results reveal substantial room for improvement, with the best F1 around $14.1\%$ and a high false-positive rate, indicating that current models struggle to saturate the benchmark. The work also analyzes vulnerability types and CVSS severities, shows cost is dominated by false positives, and discusses data quality and alternative scoring paradigms, highlighting defender advantages and guiding future research toward more robust tooling and evaluation frameworks.
Abstract
Long contexts of recent LLMs have enabled a new use case: asking models to find security vulnerabilities in entire codebases. To evaluate model performance on this task, we introduce eyeballvul: a benchmark designed to test the vulnerability detection capabilities of language models at scale, that is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories. The benchmark consists of a list of revisions in different repositories, each associated with the list of known vulnerabilities present at that revision. An LLM-based scorer is used to compare the list of possible vulnerabilities returned by a model to the list of known vulnerabilities for each revision. As of July 2024, eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.
