Favia: Forensic Agent for Vulnerability-fix Identification and Analysis
André Storhaug, Jiamou Sun, Jingyue Li
TL;DR
This work tackles the challenge of identifying vulnerability-fix commits for CVEs in large software repositories. It introduces Favia, a hybrid framework that first ranks candidate commits efficiently and then applies an iterative, evidence-driven ReAct-based agent inside a pre-change code environment to verify true patches. Using the CVEVC large-scale dataset, Favia consistently outperforms traditional and SOTA LLM baselines under realistic candidate sets, achieving superior precision–recall trade-offs and higher F1-scores, while revealing failure modes dominated by superficial associations and CVE misinterpretation. The study also shows that evaluations based on random commit sampling inflate performance, underscoring the need for realistic benchmarks and highlighting practical implications for academia and industry in secure software maintenance and vulnerability management.
Abstract
Identifying vulnerability-fixing commits corresponding to disclosed CVEs is essential for secure software maintenance but remains challenging at scale, as large repositories contain millions of commits of which only a small fraction address security issues. Existing automated approaches, including traditional machine learning techniques and recent large language model (LLM)-based methods, often suffer from poor precision-recall trade-offs. Frequently evaluated on randomly sampled commits, we uncover that they are substantially underestimating real-world difficulty, where candidate commits are already security-relevant and highly similar. We propose Favia, a forensic, agent-based framework for vulnerability-fix identification that combines scalable candidate ranking with deep and iterative semantic reasoning. Favia first employs an efficient ranking stage to narrow the search space of commits. Each commit is then rigorously evaluated using a ReAct-based LLM agent. By providing the agent with a pre-commit repository as environment, along with specialized tools, the agent tries to localize vulnerable components, navigates the codebase, and establishes causal alignment between code changes and vulnerability root causes. This evidence-driven process enables robust identification of indirect, multi-file, and non-trivial fixes that elude single-pass or similarity-based methods. We evaluate Favia on CVEVC, a large-scale dataset we made that comprises over 8 million commits from 3,708 real-world repositories, and show that it consistently outperforms state-of-the-art traditional and LLM-based baselines under realistic candidate selection, achieving the strongest precision-recall trade-offs and highest F1-scores.
