Table of Contents
Fetching ...

LLMSecCode: Evaluating Large Language Models for Secure Coding

Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren

TL;DR

LLMSecCode tackles the challenge of objectively benchmarking Large Language Models for secure coding by proposing an open-source evaluation framework that jointly covers Automated Program Repair (APR), Code Generation (CG), and Secure Coding (SC). It integrates diverse datasets (e.g., QuixBugs, HumanEval, Vul4J/VJBench, SecurityEval, CyberSecEval) and static analyzers (CodeQL, Bandit) within a configurable, repeatable workflow that includes model templates, prompt structures, and correction chains. Empirical results show that model performance can vary by more than 10% due to parameter choices ($top-p$, $temperature$) and prompt design, while comparisons with external benchmarks reveal a modest, ~5% difference, indicating reasonable alignment and opportunities for standardization. The framework demonstrates that stronger CG/APR does not necessarily yield better SC, highlighting the need for targeted security-focused evaluation and dataset diversity. LLMSecCode thus provides a practical foundation for researchers and developers to benchmark, reproduce, and advance secure-model development in a standardized, transparent manner.

Abstract

The rapid deployment of Large Language Models (LLMs) requires careful consideration of their effect on cybersecurity. Our work aims to improve the selection process of LLMs that are suitable for facilitating Secure Coding (SC). This raises challenging research questions, such as (RQ1) Which functionality can streamline the LLM evaluation? (RQ2) What should the evaluation measure? (RQ3) How to attest that the evaluation process is impartial? To address these questions, we introduce LLMSecCode, an open-source evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying parameters and prompts, we find a 10% and 9% difference in performance, respectively. We also compare some results to reliable external actors, where our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and encourage further development by external actors. With LLMSecCode, we hope to encourage the standardization and benchmarking of LLMs' capabilities in security-oriented code and tasks.

LLMSecCode: Evaluating Large Language Models for Secure Coding

TL;DR

LLMSecCode tackles the challenge of objectively benchmarking Large Language Models for secure coding by proposing an open-source evaluation framework that jointly covers Automated Program Repair (APR), Code Generation (CG), and Secure Coding (SC). It integrates diverse datasets (e.g., QuixBugs, HumanEval, Vul4J/VJBench, SecurityEval, CyberSecEval) and static analyzers (CodeQL, Bandit) within a configurable, repeatable workflow that includes model templates, prompt structures, and correction chains. Empirical results show that model performance can vary by more than 10% due to parameter choices (, ) and prompt design, while comparisons with external benchmarks reveal a modest, ~5% difference, indicating reasonable alignment and opportunities for standardization. The framework demonstrates that stronger CG/APR does not necessarily yield better SC, highlighting the need for targeted security-focused evaluation and dataset diversity. LLMSecCode thus provides a practical foundation for researchers and developers to benchmark, reproduce, and advance secure-model development in a standardized, transparent manner.

Abstract

The rapid deployment of Large Language Models (LLMs) requires careful consideration of their effect on cybersecurity. Our work aims to improve the selection process of LLMs that are suitable for facilitating Secure Coding (SC). This raises challenging research questions, such as (RQ1) Which functionality can streamline the LLM evaluation? (RQ2) What should the evaluation measure? (RQ3) How to attest that the evaluation process is impartial? To address these questions, we introduce LLMSecCode, an open-source evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying parameters and prompts, we find a 10% and 9% difference in performance, respectively. We also compare some results to reliable external actors, where our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and encourage further development by external actors. With LLMSecCode, we hope to encourage the standardization and benchmarking of LLMs' capabilities in security-oriented code and tasks.
Paper Structure (57 sections, 6 figures, 11 tables)

This paper contains 57 sections, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Correction chain prompt example.
  • Figure 2: Rough flow chart of LLMSecCode.
  • Figure 3: Rough flow chart of the "Evaluate model on the single dataset" component of LLMSecCode.
  • Figure 4: HumanEval and QuixBugs results generated with greedy decoding and max new tokens of 400 for HumanEval and 1000 for QuixBugs. The figure is sorted by best to worst performance on HumanEval.
  • Figure 5: CyberSecEval Instruct, SecurityEval, Vul4J and VJBench (llm-vul) results. Generated with greedy decoding and max new tokens of 800 for CyberSecEval Instruct, SecurityEval with 400 tokens, and llm-vul with 2000 tokens. The graph has the same order as in Figure \ref{['fig:heqb']}.
  • ...and 1 more figures