Table of Contents
Fetching ...

Favia: Forensic Agent for Vulnerability-fix Identification and Analysis

André Storhaug, Jiamou Sun, Jingyue Li

TL;DR

This work tackles the challenge of identifying vulnerability-fix commits for CVEs in large software repositories. It introduces Favia, a hybrid framework that first ranks candidate commits efficiently and then applies an iterative, evidence-driven ReAct-based agent inside a pre-change code environment to verify true patches. Using the CVEVC large-scale dataset, Favia consistently outperforms traditional and SOTA LLM baselines under realistic candidate sets, achieving superior precision–recall trade-offs and higher F1-scores, while revealing failure modes dominated by superficial associations and CVE misinterpretation. The study also shows that evaluations based on random commit sampling inflate performance, underscoring the need for realistic benchmarks and highlighting practical implications for academia and industry in secure software maintenance and vulnerability management.

Abstract

Identifying vulnerability-fixing commits corresponding to disclosed CVEs is essential for secure software maintenance but remains challenging at scale, as large repositories contain millions of commits of which only a small fraction address security issues. Existing automated approaches, including traditional machine learning techniques and recent large language model (LLM)-based methods, often suffer from poor precision-recall trade-offs. Frequently evaluated on randomly sampled commits, we uncover that they are substantially underestimating real-world difficulty, where candidate commits are already security-relevant and highly similar. We propose Favia, a forensic, agent-based framework for vulnerability-fix identification that combines scalable candidate ranking with deep and iterative semantic reasoning. Favia first employs an efficient ranking stage to narrow the search space of commits. Each commit is then rigorously evaluated using a ReAct-based LLM agent. By providing the agent with a pre-commit repository as environment, along with specialized tools, the agent tries to localize vulnerable components, navigates the codebase, and establishes causal alignment between code changes and vulnerability root causes. This evidence-driven process enables robust identification of indirect, multi-file, and non-trivial fixes that elude single-pass or similarity-based methods. We evaluate Favia on CVEVC, a large-scale dataset we made that comprises over 8 million commits from 3,708 real-world repositories, and show that it consistently outperforms state-of-the-art traditional and LLM-based baselines under realistic candidate selection, achieving the strongest precision-recall trade-offs and highest F1-scores.

Favia: Forensic Agent for Vulnerability-fix Identification and Analysis

TL;DR

This work tackles the challenge of identifying vulnerability-fix commits for CVEs in large software repositories. It introduces Favia, a hybrid framework that first ranks candidate commits efficiently and then applies an iterative, evidence-driven ReAct-based agent inside a pre-change code environment to verify true patches. Using the CVEVC large-scale dataset, Favia consistently outperforms traditional and SOTA LLM baselines under realistic candidate sets, achieving superior precision–recall trade-offs and higher F1-scores, while revealing failure modes dominated by superficial associations and CVE misinterpretation. The study also shows that evaluations based on random commit sampling inflate performance, underscoring the need for realistic benchmarks and highlighting practical implications for academia and industry in secure software maintenance and vulnerability management.

Abstract

Identifying vulnerability-fixing commits corresponding to disclosed CVEs is essential for secure software maintenance but remains challenging at scale, as large repositories contain millions of commits of which only a small fraction address security issues. Existing automated approaches, including traditional machine learning techniques and recent large language model (LLM)-based methods, often suffer from poor precision-recall trade-offs. Frequently evaluated on randomly sampled commits, we uncover that they are substantially underestimating real-world difficulty, where candidate commits are already security-relevant and highly similar. We propose Favia, a forensic, agent-based framework for vulnerability-fix identification that combines scalable candidate ranking with deep and iterative semantic reasoning. Favia first employs an efficient ranking stage to narrow the search space of commits. Each commit is then rigorously evaluated using a ReAct-based LLM agent. By providing the agent with a pre-commit repository as environment, along with specialized tools, the agent tries to localize vulnerable components, navigates the codebase, and establishes causal alignment between code changes and vulnerability root causes. This evidence-driven process enables robust identification of indirect, multi-file, and non-trivial fixes that elude single-pass or similarity-based methods. We evaluate Favia on CVEVC, a large-scale dataset we made that comprises over 8 million commits from 3,708 real-world repositories, and show that it consistently outperforms state-of-the-art traditional and LLM-based baselines under realistic candidate selection, achieving the strongest precision-recall trade-offs and highest F1-scores.
Paper Structure (59 sections, 12 figures, 5 tables)

This paper contains 59 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Top-n ranking using machine learning classifier.
  • Figure 2: Agent classification for each top-n ranked commits.
  • Figure 3: PatchFinder effectiveness on test split of CVEVC dataset.
  • Figure 4: Reasoning output of CommitShield's analysis of commit 705a427 against CVE-2014-9625.
  • Figure 5: Reasoning output of LLM4VFD's analysis of commit 705a427 against CVE-2014-9625.
  • ...and 7 more figures