Table of Contents
Fetching ...

eyeballvul: a future-proof benchmark for vulnerability detection in the wild

Timothee Chauvin

TL;DR

Eyeballvul addresses the challenge of evaluating vulnerability detection in large codebases by introducing a real-world, continuously updated benchmark built from CVEs sourced from open repositories. The authors repack CVEs into per-revision ground truth via a minimum hitting set approach, enabling long-context LLMs to be evaluated on their ability to propose credible leads, with ground-truth matching performed by an LLM-based scorer trained to map leads to known CVEs. Across seven long-context models, the results reveal substantial room for improvement, with the best F1 around $14.1\%$ and a high false-positive rate, indicating that current models struggle to saturate the benchmark. The work also analyzes vulnerability types and CVSS severities, shows cost is dominated by false positives, and discusses data quality and alternative scoring paradigms, highlighting defender advantages and guiding future research toward more robust tooling and evaluation frameworks.

Abstract

Long contexts of recent LLMs have enabled a new use case: asking models to find security vulnerabilities in entire codebases. To evaluate model performance on this task, we introduce eyeballvul: a benchmark designed to test the vulnerability detection capabilities of language models at scale, that is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories. The benchmark consists of a list of revisions in different repositories, each associated with the list of known vulnerabilities present at that revision. An LLM-based scorer is used to compare the list of possible vulnerabilities returned by a model to the list of known vulnerabilities for each revision. As of July 2024, eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.

eyeballvul: a future-proof benchmark for vulnerability detection in the wild

TL;DR

Eyeballvul addresses the challenge of evaluating vulnerability detection in large codebases by introducing a real-world, continuously updated benchmark built from CVEs sourced from open repositories. The authors repack CVEs into per-revision ground truth via a minimum hitting set approach, enabling long-context LLMs to be evaluated on their ability to propose credible leads, with ground-truth matching performed by an LLM-based scorer trained to map leads to known CVEs. Across seven long-context models, the results reveal substantial room for improvement, with the best F1 around and a high false-positive rate, indicating that current models struggle to saturate the benchmark. The work also analyzes vulnerability types and CVSS severities, shows cost is dominated by false positives, and discusses data quality and alternative scoring paradigms, highlighting defender advantages and guiding future research toward more robust tooling and evaluation frameworks.

Abstract

Long contexts of recent LLMs have enabled a new use case: asking models to find security vulnerabilities in entire codebases. To evaluate model performance on this task, we introduce eyeballvul: a benchmark designed to test the vulnerability detection capabilities of language models at scale, that is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories. The benchmark consists of a list of revisions in different repositories, each associated with the list of known vulnerabilities present at that revision. An LLM-based scorer is used to compare the list of possible vulnerabilities returned by a model to the list of known vulnerabilities for each revision. As of July 2024, eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.
Paper Structure (35 sections, 8 figures, 2 tables)

This paper contains 35 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Distribution of revisions and vulnerabilities by date
  • Figure 2: Number of revisions by size: many revisions fit within current models' long contexts
  • Figure 3: Precision, recall, and F1 score of models on the benchmark
  • Figure 4: Pareto efficiency plot of model performance
  • Figure 5: Top 10 most frequent CWEs among true positives, and their ranks in MITRE's 2023 CWE Top 25
  • ...and 3 more figures