Table of Contents
Fetching ...

Rethinking the Evaluation of Secure Code Generation

Shih-Chieh Dai, Jun Xu, Guanhong Tao

TL;DR

The paper challenges the standard practice of evaluating secure code generation in isolation by proposing a joint security-functionality framework and employing multiple vulnerability detectors across large benchmarks. It introduces the SAFE metric, enabling nuanced assessment of code security alongside unit-test-based functionality, and extends SecCodePLT+ with unit tests to provide a robust benchmark. Across four techniques and multiple LLMs, the study finds that security gains are often achieved at the cost of functionality, with vulnerability detectors showing inconsistent results and true improvements being limited. The work advocates for multi-tool security evaluation and joint metrics as essential for advancing secure code generation in practice, and it provides a public benchmark and classification of failure modes to guide future research.

Abstract

Large language models (LLMs) are widely used in software development. However, the code generated by LLMs often contains vulnerabilities. Several secure code generation methods have been proposed to address this issue, but their current evaluation schemes leave several concerns unaddressed. Specifically, most existing studies evaluate security and functional correctness separately, using different datasets. That is, they assess vulnerabilities using security-related code datasets while validating functionality with general code datasets. In addition, prior research primarily relies on a single static analyzer, CodeQL, to detect vulnerabilities in generated code, which limits the scope of security evaluation. In this work, we conduct a comprehensive study to systematically assess the improvements introduced by four state-of-the-art secure code generation techniques. Specifically, we apply both security inspection and functionality validation to the same generated code and evaluate these two aspects together. We also employ three popular static analyzers and two LLMs to identify potential vulnerabilities in the generated code. Our study reveals that existing techniques often compromise the functionality of generated code to enhance security. Their overall performance remains limited when evaluating security and functionality together. In fact, many techniques even degrade the performance of the base LLM by more than 50%. Our further inspection reveals that these techniques often either remove vulnerable lines of code entirely or generate ``garbage code'' that is unrelated to the intended task. Moreover, the commonly used static analyzer CodeQL fails to detect several vulnerabilities, further obscuring the actual security improvements achieved by existing techniques.

Rethinking the Evaluation of Secure Code Generation

TL;DR

The paper challenges the standard practice of evaluating secure code generation in isolation by proposing a joint security-functionality framework and employing multiple vulnerability detectors across large benchmarks. It introduces the SAFE metric, enabling nuanced assessment of code security alongside unit-test-based functionality, and extends SecCodePLT+ with unit tests to provide a robust benchmark. Across four techniques and multiple LLMs, the study finds that security gains are often achieved at the cost of functionality, with vulnerability detectors showing inconsistent results and true improvements being limited. The work advocates for multi-tool security evaluation and joint metrics as essential for advancing secure code generation in practice, and it provides a public benchmark and classification of failure modes to guide future research.

Abstract

Large language models (LLMs) are widely used in software development. However, the code generated by LLMs often contains vulnerabilities. Several secure code generation methods have been proposed to address this issue, but their current evaluation schemes leave several concerns unaddressed. Specifically, most existing studies evaluate security and functional correctness separately, using different datasets. That is, they assess vulnerabilities using security-related code datasets while validating functionality with general code datasets. In addition, prior research primarily relies on a single static analyzer, CodeQL, to detect vulnerabilities in generated code, which limits the scope of security evaluation. In this work, we conduct a comprehensive study to systematically assess the improvements introduced by four state-of-the-art secure code generation techniques. Specifically, we apply both security inspection and functionality validation to the same generated code and evaluate these two aspects together. We also employ three popular static analyzers and two LLMs to identify potential vulnerabilities in the generated code. Our study reveals that existing techniques often compromise the functionality of generated code to enhance security. Their overall performance remains limited when evaluating security and functionality together. In fact, many techniques even degrade the performance of the base LLM by more than 50%. Our further inspection reveals that these techniques often either remove vulnerable lines of code entirely or generate ``garbage code'' that is unrelated to the intended task. Moreover, the commonly used static analyzer CodeQL fails to detect several vulnerabilities, further obscuring the actual security improvements achieved by existing techniques.

Paper Structure

This paper contains 16 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: An example where CodeQL overlooks a vulnerability in the generated code. Line 15 contains a CWE-78 vulnerability that CodeQL fails to identify.
  • Figure 2: Generated code by DeepSeek-Coder-V2-Lite before (left) and after (right) applying CodeGuard+. The code on the left is functional but insecure. CWE-22 vulnerabilities exist at lines 6, 10, and 16. The code on the right is secure but not functional.
  • Figure 3: The results of Secure@1 from each static analyzer and Pass@1 on BigCodeBench. The results are separated by a dashed line. CL, BR, and BT represent CodeQL, Bearer, and Bandit, respectively. "Case" and "Task" indicate the Pass@1 scores calculated at the test case level (considering percentage of passed unit tests) and the task level (passing all unit tests).
  • Figure 4: The results of Secure@1 from each static analyzer and Pass@1 on SecCodePLT+. The results are separated by a dashed line. CL, BR, and BT represent CodeQL, Bearer, and Bandit, respectively. "Case" and "Task" indicate the Pass@1 scores calculated at the test case level (considering percentage of passed unit tests) and the task level (passing all unit tests).
  • Figure 5: CWE vulnerabilities identified by the three static analyzers for Qwen2.5-Coder enhanced by SafeCoder on the BigCodeBench dataset.
  • ...and 4 more figures