Table of Contents
Fetching ...

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, Baishakhi Ray

TL;DR

CWEval addresses the critical need to evaluate LLM-generated code for both functionality and security, proposing an outcome-driven framework with two test oracles and a multilingual benchmark. It demonstrates that many models produce functionally correct but insecure code and reveals inconsistencies in prior evaluations that relied on static analysis. The study shows that larger models and security-focused prompts can improve security performance, but also uncovers pitfalls where certain fine-tuning strategies degrade overall functionality. The open-source CWEval and CWEval-bench enable reproducible, joint assessments of secure code generation with practical implications for model development and deployment.

Abstract

Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy. Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks' shortcomings. Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation. We open-source our artifact at: https://github.com/Co1lin/CWEval .

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

TL;DR

CWEval addresses the critical need to evaluate LLM-generated code for both functionality and security, proposing an outcome-driven framework with two test oracles and a multilingual benchmark. It demonstrates that many models produce functionally correct but insecure code and reveals inconsistencies in prior evaluations that relied on static analysis. The study shows that larger models and security-focused prompts can improve security performance, but also uncovers pitfalls where certain fine-tuning strategies degrade overall functionality. The open-source CWEval and CWEval-bench enable reproducible, joint assessments of secure code generation with practical implications for model development and deployment.

Abstract

Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy. Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks' shortcomings. Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation. We open-source our artifact at: https://github.com/Co1lin/CWEval .
Paper Structure (15 sections, 3 figures, 2 tables)

This paper contains 15 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: (1) The CodeQL rule checking "incomplete URL substring sanitization (CWE-020)" looks for certain insecure sanitization methods including in, startswith and endswith. This rule is used in SecurityEval siddiq2022securityeval for the coding task shown in \ref{['fig:cweval']} (at the upper-right corner). (2) Two vulnerable implementations are successfully caught by this rule. (3) However, with only slight differences, two other insecure code are not reported and considered as safe ones in previous evaluations (false negatives). (4) Plus, a secure implementation can also be wrongly flagged as vulnerable (false positives).
  • Figure 2: Inspired by documentations about CWEs, CWEval consists of coding tasks with three components: specifications (2), reference implementations (3) and test oracles (4). Compared to previous benchmarks like SecurityEval siddiq2022securityeval, our coding task designs are isolated from third-party dependencies as much as possible. Our specifications are more definite and ensure the existence of HTML]FFCCE6security-related semantics (a user provided URL will be used to do redirecting, so proper sanitation is needed here). Test oracles for functionality and security evaluations are included. For each task, the secure reference implementation can pass all test oracles, while the functional but insecure reference implementation can only pass functionality test oracles and fail on security test oracles.
  • Figure 3: Evaluating LLMs on CWEval-bench. Best performing results among all temperature settings are presented. Results of greedy decoding are labeled as func@1* and func-sec@1*. Results of func@10 and func-sec@10 are labeled at upper positions. Results of func@1* and func-sec@1* are labeled at the bottom.