Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study

Cheng Cheng

Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study

Cheng Cheng

TL;DR

This work studies such a workflow for project-level artifacts and addresses four gaps: L1, the lack of project-level benchmarks with executable function and security tests; L2, limited evidence on pipeline-level effectiveness beyond studying detection or repair alone; L3, unclear reliability of detection reports as repair guidance; and L4, uncertain repair trustworthiness and side effects under verification.

Abstract

Large language models are increasingly used to produce runnable software. In practice, security is often addressed through a Detect--Repair--Verify (DRV) loop that detects issues, applies fixes, and verifies the result. This work studies such a workflow for project-level artifacts and addresses four gaps: L1, the lack of project-level benchmarks with executable function and security tests; L2, limited evidence on pipeline-level effectiveness beyond studying detection or repair alone; L3, unclear reliability of detection reports as repair guidance; and L4, uncertain repair trustworthiness and side effects under verification. A new benchmark dataset\footnote{https://github.com/Hahappyppy2024/EmpricalVDR} is introduced, consisting of runnable web-application projects paired with functional tests and targeted security tests, and supporting three prompt granularities at the project, requirement, and function level. The evaluation compares generation-only, single-pass DRV, and bounded iterative DRV variants under comparable budget constraints. Outcomes are measured by secure and correct yield using test-grounded verification, and intermediate artifacts are analyzed to assess report actionability and post-repair failure modes such as regressions, semantic drift, and newly introduced security issues.

Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study

TL;DR

Abstract

Paper Structure (60 sections, 1 figure, 14 tables)

This paper contains 60 sections, 1 figure, 14 tables.

Introduction
Related Work
Secure Code Generation by LLMs
Vulnerability Detection
Traditional static analysis.
Dynamic analysis.
Learning-based detection.
Learning-based detection.
Vulnerability Localization
Vulnerability Repair
Pattern- and learning-based vulnerability repair.
LLM-based Vulnerability Repair
Early exploration of LLM capabilities for repair.
Empirical understanding of LLM-based repair.
Prompt-driven and iterative LLM repair.
...and 45 more sections

Figures (1)

Figure 1: Overview of the experimental pipeline. Starting from the EduCollab benchmark (PHP/JS/Python), artifacts are constructed under three prompt granularities (project-, requirement-, and function-level) and processed through a detect--repair--verify workflow. Verification is performed via the test suite, whose outcomes provide grounded feedback to drive bounded iterations and enable measurement of RQ1--RQ3.

Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study

TL;DR

Abstract

Detect Repair Verify for Securing LLM Generated Code: A Multi-Language Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (1)