A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback
Ummay Kulsum, Haotian Zhu, Bowen Xu, Marcelo d'Amorim
TL;DR
The paper investigates whether combining reasoning (chain-of-thought prompting) with patch validation feedback improves automated vulnerability repair using large language models. It introduces VRpilot, an LLM-based approach that first reasons about a vulnerability and then iteratively refines a patch using external tool outputs (compilation, functional tests, and security sanitizers). Across C and Java vulnerability datasets, VRpilot outperforms a strong baseline (CodexVR configured with GPT-3.5), achieving higher rates of compilable, plausible, and semantically correct patches, with ablations confirming the synergy between reasoning and feedback. The study emphasizes the importance of detailed context, domain knowledge, and high-quality datasets, and it suggests that LLMs are best used to assist human developers in semi-automatic vulnerability repair, rather than fully replacing them.
Abstract
Recent work in automated program repair (APR) proposes the use of reasoning and patch validation feedback to reduce the semantic gap between the LLMs and the code under analysis. The idea has been shown to perform well for general APR, but its effectiveness in other particular contexts remains underexplored. In this work, we assess the impact of reasoning and patch validation feedback to LLMs in the context of vulnerability repair, an important and challenging task in security. To support the evaluation, we present VRpilot, an LLM-based vulnerability repair technique based on reasoning and patch validation feedback. VRpilot (1) uses a chain-of-thought prompt to reason about a vulnerability prior to generating patch candidates and (2) iteratively refines prompts according to the output of external tools (e.g., compiler, code sanitizers, test suite, etc.) on previously-generated patches. To evaluate performance, we compare VRpilot against the state-of-the-art vulnerability repair techniques for C and Java using public datasets from the literature. Our results show that VRpilot generates, on average, 14% and 7.6% more correct patches than the baseline techniques on C and Java, respectively. We show, through an ablation study, that reasoning and patch validation feedback are critical. We report several lessons from this study and potential directions for advancing LLM-empowered vulnerability repair
