Table of Contents
Fetching ...

Persistent Human Feedback, LLMs, and Static Analyzers for Secure Code Generation and Vulnerability Detection

Ehsan Firouzi, Mohammad Ghafari

TL;DR

This paper questions the reliability of static analysis tools as sole evaluators for LLM-generated secure code and vulnerability detection by auditing 1,080 GPT-4o-generated samples against ground-truth CWE labels. It shows that Semgrep and CodeQL, while yielding similar aggregate security rates to humans, disagree with ground-truth on a substantial per-sample basis, underscoring the need for expert feedback. Building on this insight, the authors propose a conceptual framework that persistently stores human feedback in a dynamic retrieval-augmented generation pipeline, integrating a Prompt Agent, Security Agent, and Human-in-the-Loop (HIL) Agent within a Dual-Source RAG to improve future generation and detection outcomes. The framework introduces dynamic trust weighting, EMA-based success updates, and safeguards against feedback attacks, aiming to make LLM-driven secure coding more reliable and auditable in practice. Overall, the work advocates for persistent human-in-the-loop feedback as a core component of LLM-based security workflows to enhance robustness and trust in automated code security assessments.

Abstract

Existing literature heavily relies on static analysis tools to evaluate LLMs for secure code generation and vulnerability detection. We reviewed 1,080 LLM-generated code samples, built a human-validated ground-truth, and compared the outputs of two widely used static security tools, CodeQL and Semgrep, against this corpus. While 61% of the samples were genuinely secure, Semgrep and CodeQL classified 60% and 80% as secure, respectively. Despite the apparent agreement in aggregate statistics, per-sample analysis reveals substantial discrepancies: only 65% of Semgrep's and 61% of CodeQL's reports correctly matched the ground truth. These results question the reliability of static analysis tools as sole evaluators of code security and underscore the need for expert feedback. Building on this insight, we propose a conceptual framework that persistently stores human feedback in a dynamic retrieval-augmented generation pipeline, enabling LLMs to reuse past feedback for secure code generation and vulnerability detection.

Persistent Human Feedback, LLMs, and Static Analyzers for Secure Code Generation and Vulnerability Detection

TL;DR

This paper questions the reliability of static analysis tools as sole evaluators for LLM-generated secure code and vulnerability detection by auditing 1,080 GPT-4o-generated samples against ground-truth CWE labels. It shows that Semgrep and CodeQL, while yielding similar aggregate security rates to humans, disagree with ground-truth on a substantial per-sample basis, underscoring the need for expert feedback. Building on this insight, the authors propose a conceptual framework that persistently stores human feedback in a dynamic retrieval-augmented generation pipeline, integrating a Prompt Agent, Security Agent, and Human-in-the-Loop (HIL) Agent within a Dual-Source RAG to improve future generation and detection outcomes. The framework introduces dynamic trust weighting, EMA-based success updates, and safeguards against feedback attacks, aiming to make LLM-driven secure coding more reliable and auditable in practice. Overall, the work advocates for persistent human-in-the-loop feedback as a core component of LLM-based security workflows to enhance robustness and trust in automated code security assessments.

Abstract

Existing literature heavily relies on static analysis tools to evaluate LLMs for secure code generation and vulnerability detection. We reviewed 1,080 LLM-generated code samples, built a human-validated ground-truth, and compared the outputs of two widely used static security tools, CodeQL and Semgrep, against this corpus. While 61% of the samples were genuinely secure, Semgrep and CodeQL classified 60% and 80% as secure, respectively. Despite the apparent agreement in aggregate statistics, per-sample analysis reveals substantial discrepancies: only 65% of Semgrep's and 61% of CodeQL's reports correctly matched the ground truth. These results question the reliability of static analysis tools as sole evaluators of code security and underscore the need for expert feedback. Building on this insight, we propose a conceptual framework that persistently stores human feedback in a dynamic retrieval-augmented generation pipeline, enabling LLMs to reuse past feedback for secure code generation and vulnerability detection.
Paper Structure (14 sections, 7 equations, 3 figures, 6 tables)

This paper contains 14 sections, 7 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Distribution of targeted CWEs in the ground-truth dataset
  • Figure 2: Tools agreement (intersection) and coverage (union) across CWEs based on ground truth
  • Figure 3: The proposed framework