Table of Contents
Fetching ...

Detect--Repair--Verify for LLM-Generated Code: A Multi-Language, Multi-Granularity Empirical Study

Cheng Cheng

Abstract

Large language models can generate runnable software artifacts, but their security remains difficult to evaluate end to end. This study examines that problem through a Detect--Repair--Verify (DRV) workflow, in which vulnerabilities are detected, repaired, and then rechecked with security and functional tests. It addresses four gaps in current evidence: the lack of test-grounded benchmarks for LLM-generated artifacts, limited evidence on pipeline-level effectiveness, unclear reliability of detection reports as repair guidance, and uncertain repair trustworthiness under verification. To support this study, EduCollab is constructed as a multi-language, multi-granularity benchmark of runnable LLM-generated web applications in PHP, JavaScript, and Python. Each artifact is paired with executable functional and exploit test suites, and the benchmark spans project-, requirement-, and file-level settings. On this benchmark, the study compares unrepaired baselines, single-pass detect--repair, and bounded iterative DRV under comparable budget constraints. Outcomes are measured by secure-and-correct yield, and intermediate artifacts and iteration traces are analyzed to assess report actionability and repair failure modes. The results show that bounded iterative DRV can improve secure-and-correct yield over single-pass repair, but the gains are uneven at the project level and become clearer at narrower repair scopes. Detection reports are often useful for downstream repair, but their reliability is inconsistent. Repair trustworthiness also depends strongly on repair scope. These findings highlight the need for test-grounded, end-to-end evaluation of LLM-based vulnerability management workflows.

Detect--Repair--Verify for LLM-Generated Code: A Multi-Language, Multi-Granularity Empirical Study

Abstract

Large language models can generate runnable software artifacts, but their security remains difficult to evaluate end to end. This study examines that problem through a Detect--Repair--Verify (DRV) workflow, in which vulnerabilities are detected, repaired, and then rechecked with security and functional tests. It addresses four gaps in current evidence: the lack of test-grounded benchmarks for LLM-generated artifacts, limited evidence on pipeline-level effectiveness, unclear reliability of detection reports as repair guidance, and uncertain repair trustworthiness under verification. To support this study, EduCollab is constructed as a multi-language, multi-granularity benchmark of runnable LLM-generated web applications in PHP, JavaScript, and Python. Each artifact is paired with executable functional and exploit test suites, and the benchmark spans project-, requirement-, and file-level settings. On this benchmark, the study compares unrepaired baselines, single-pass detect--repair, and bounded iterative DRV under comparable budget constraints. Outcomes are measured by secure-and-correct yield, and intermediate artifacts and iteration traces are analyzed to assess report actionability and repair failure modes. The results show that bounded iterative DRV can improve secure-and-correct yield over single-pass repair, but the gains are uneven at the project level and become clearer at narrower repair scopes. Detection reports are often useful for downstream repair, but their reliability is inconsistent. Repair trustworthiness also depends strongly on repair scope. These findings highlight the need for test-grounded, end-to-end evaluation of LLM-based vulnerability management workflows.
Paper Structure (62 sections, 4 figures, 19 tables)

This paper contains 62 sections, 4 figures, 19 tables.

Figures (4)

  • Figure 1: Two motivating scenarios for securing LLM-generated artifacts through detect--repair--verify. In one scenario, vulnerabilities remain latent and must first be identified before repair can begin. In the other, vulnerabilities are already known, so repair starts from explicit guidance. Although the entry points differ, both scenarios are handled within the same detect--repair--verify framework and are evaluated against the same outcome: whether the patched artifact preserves intended functionality while mitigating the targeted vulnerability.
  • Figure 2: Overview of the experimental pipeline. Starting from the EduCollab benchmark (PHP/JS/Python), the study evaluates artifacts under three prompt granularities: project-level, requirement-level, and file-level. These artifacts are processed through a detect--repair--verify workflow, where repair generates LLM-based patches and verification reruns the executable test suites. The resulting feedback supports bounded iteration and enables the evaluation of RQ1--RQ3.
  • Figure 3: Detect--repair--verify workflow for artifacts with latent vulnerabilities. Starting from a runnable baseline artifact, the workflow uses project-level or requirement-level prompts to detect candidate vulnerabilities, generate patches, and verify patched artifacts by rerunning both functional and exploit tests under a bounded iteration budget.
  • Figure 4: Detect--repair--verify workflow for artifacts with pre-identified vulnerabilities. Starting from a runnable baseline artifact paired with a specific exploit target, the workflow uses file-level prompts to localize and repair one vulnerability at a time, and verifies each patched artifact by rerunning both functional and exploit tests under a bounded iteration budget.