A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback

Ummay Kulsum; Haotian Zhu; Bowen Xu; Marcelo d'Amorim

A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback

Ummay Kulsum, Haotian Zhu, Bowen Xu, Marcelo d'Amorim

TL;DR

The paper investigates whether combining reasoning (chain-of-thought prompting) with patch validation feedback improves automated vulnerability repair using large language models. It introduces VRpilot, an LLM-based approach that first reasons about a vulnerability and then iteratively refines a patch using external tool outputs (compilation, functional tests, and security sanitizers). Across C and Java vulnerability datasets, VRpilot outperforms a strong baseline (CodexVR configured with GPT-3.5), achieving higher rates of compilable, plausible, and semantically correct patches, with ablations confirming the synergy between reasoning and feedback. The study emphasizes the importance of detailed context, domain knowledge, and high-quality datasets, and it suggests that LLMs are best used to assist human developers in semi-automatic vulnerability repair, rather than fully replacing them.

Abstract

Recent work in automated program repair (APR) proposes the use of reasoning and patch validation feedback to reduce the semantic gap between the LLMs and the code under analysis. The idea has been shown to perform well for general APR, but its effectiveness in other particular contexts remains underexplored. In this work, we assess the impact of reasoning and patch validation feedback to LLMs in the context of vulnerability repair, an important and challenging task in security. To support the evaluation, we present VRpilot, an LLM-based vulnerability repair technique based on reasoning and patch validation feedback. VRpilot (1) uses a chain-of-thought prompt to reason about a vulnerability prior to generating patch candidates and (2) iteratively refines prompts according to the output of external tools (e.g., compiler, code sanitizers, test suite, etc.) on previously-generated patches. To evaluate performance, we compare VRpilot against the state-of-the-art vulnerability repair techniques for C and Java using public datasets from the literature. Our results show that VRpilot generates, on average, 14% and 7.6% more correct patches than the baseline techniques on C and Java, respectively. We show, through an ablation study, that reasoning and patch validation feedback are critical. We report several lessons from this study and potential directions for advancing LLM-empowered vulnerability repair

A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback

TL;DR

Abstract

Paper Structure (23 sections, 3 figures, 6 tables)

This paper contains 23 sections, 3 figures, 6 tables.

Introduction
Methodology
Problem Formulation
Overview
Chain of Thought
Patch Validation Feedback
Implementation
Experimental Setting
Comparison baseline
Research Questions
Metrics
Datasets
Results
RQ1: What is the impact of the base LLM and prompt selection on the state-of-the-art technique for VR?
RQ2: How does VRpilot compare against the baseline?
...and 8 more sections

Figures (3)

Figure 1: An example of correct patches generated by VRpilot. ① VRpilot queries ChatGPT using a prompt with a high-level description of the vulnerability ("Heap Buffer Overflow") and a fragment of the vulnerable code. The prompt also includes the trigger "Let's think step by step" to access ChatGPT's reasoning feature kojima2022large. ② In response, ChatGPT explains the problem and suggests a solution. ③ VRpilot queries the model again combining the initial prompt and reasoning information, and a patch is produced. ④ However, this patch does not pass the security tests (). VRpilot leverages the output of external tools (e.g., compiler, functional and security tests) to circumvent the problem. In this case, VRpilot repeats the previous process after incorporating the error message generated by the address sanitizer (introduced by the compiler). ⑤ ChatGPT updates its reasoning and generates another patch. ⑥ This patch passes all tests ().
Figure 2: Overview of VRpilot. The repair process starts by constructing an initial prompt based on vulnerability information and passing it to the Chain-of-Thought block (CoT). The CoT block adds a trigger sentence and queries the LLM to generate reasoning for the task. The final CoT prompt combines the initial CoT prompt, the reasoning, and another trigger sentence. The LLM generates the repair patch using this final CoT prompt, and then the patch is compiled and tested. If the patch passes, it is considered a plausible patch. If not, VRpilot refines the incorrect patch by creating a feedback prompt with the error messages and vulnerability information and repeats the same steps.
Figure 3: A Complex Patch for EF15 out-of-bounds vulnerability

A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback

TL;DR

Abstract

A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (3)