Table of Contents
Fetching ...

Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs

Maria Camporese, Fabio Massacci

TL;DR

The paper investigates whether high performance of LLMs in automated vulnerability repair rests on memorization from training data or genuine generalization. By deliberately perturbing vulnerability localization and incorporating a second-opinion reviewer, the study differentiates memorization-driven patches from those produced by authentic problem understanding. Using Vul4J and VJBench-trans real-world Java vulnerabilities, the authors evaluate patch generation, testing, and manual validation, employing equivalence testing to assess prompt and localization effects. Findings aim to clarify the extent to which LLMs truly repair vulnerabilities vs. reproduce known fixes, with implications for reproducibility and methodology in AVR research and deployment.

Abstract

Background: Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair. Recent studies show that large language models (LLMs) outperform traditional techniques, extending their success beyond code generation and fault detection. Hypothesis: These gains may be driven by hidden factors -- "invisible hands" such as training-data leakage or perfect fault localization -- that let an LLM reproduce human-authored fixes for the same code. Objective: We replicate prior AVR studies under controlled conditions by deliberately adding errors to the reported vulnerability location in the prompt. If LLMs merely regurgitate memorized fixes, both small and large localization errors should yield the same number of correct patches, because any offset should divert the model from the original fix. Method: Our pipeline repairs vulnerabilities from the Vul4J and VJTrans benchmarks after shifting the fault location by n lines from the ground truth. A first LLM generates a patch, a second LLM reviews it, and we validate the result with regression and proof-of-vulnerability tests. Finally, we manually audit a sample of patches and estimate the error rate with the Agresti-Coull-Wilson method.

Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs

TL;DR

The paper investigates whether high performance of LLMs in automated vulnerability repair rests on memorization from training data or genuine generalization. By deliberately perturbing vulnerability localization and incorporating a second-opinion reviewer, the study differentiates memorization-driven patches from those produced by authentic problem understanding. Using Vul4J and VJBench-trans real-world Java vulnerabilities, the authors evaluate patch generation, testing, and manual validation, employing equivalence testing to assess prompt and localization effects. Findings aim to clarify the extent to which LLMs truly repair vulnerabilities vs. reproduce known fixes, with implications for reproducibility and methodology in AVR research and deployment.

Abstract

Background: Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair. Recent studies show that large language models (LLMs) outperform traditional techniques, extending their success beyond code generation and fault detection. Hypothesis: These gains may be driven by hidden factors -- "invisible hands" such as training-data leakage or perfect fault localization -- that let an LLM reproduce human-authored fixes for the same code. Objective: We replicate prior AVR studies under controlled conditions by deliberately adding errors to the reported vulnerability location in the prompt. If LLMs merely regurgitate memorized fixes, both small and large localization errors should yield the same number of correct patches, because any offset should divert the model from the original fix. Method: Our pipeline repairs vulnerabilities from the Vul4J and VJTrans benchmarks after shifting the fault location by n lines from the ground truth. A first LLM generates a patch, a second LLM reviews it, and we validate the result with regression and proof-of-vulnerability tests. Finally, we manually audit a sample of patches and estimate the error rate with the Agresti-Coull-Wilson method.

Paper Structure

This paper contains 26 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: While different studies tested LLMs for repair starting from the exact vulnerability localization, we investigate the impact of errors in the localization for different approaches. For example, in the approach proposed by Wu et al. wu2023effective, the LLM is prompted to substitute exactly the vulnerable lines, so even small displacements would prove disruptive. Here we prompted GPT3.5 (used by Kulsum et al. kulsum2024case) to fix the vulnerable function, but gave in the prompt wrong information about the vulnerable line. In the figure, we have two negative and one positive offsets and the corresponding responses. When the offset is 4 lines above, pointing to just a curly bracket (line 2), the model still generates the developer fix. How is this likely to happen in the absence of memorization?
  • Figure 2: Execution plan.
  • Figure 3: Helmert contrast for $RQ1$
  • Figure 4: Helmert contrast for $RQ1_{ERR}$