LLMSecCode: Evaluating Large Language Models for Secure Coding

Anton Rydén; Erik Näslund; Elad Michael Schiller; Magnus Almgren

LLMSecCode: Evaluating Large Language Models for Secure Coding

Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren

TL;DR

LLMSecCode tackles the challenge of objectively benchmarking Large Language Models for secure coding by proposing an open-source evaluation framework that jointly covers Automated Program Repair (APR), Code Generation (CG), and Secure Coding (SC). It integrates diverse datasets (e.g., QuixBugs, HumanEval, Vul4J/VJBench, SecurityEval, CyberSecEval) and static analyzers (CodeQL, Bandit) within a configurable, repeatable workflow that includes model templates, prompt structures, and correction chains. Empirical results show that model performance can vary by more than 10% due to parameter choices ($top-p$, $temperature$) and prompt design, while comparisons with external benchmarks reveal a modest, ~5% difference, indicating reasonable alignment and opportunities for standardization. The framework demonstrates that stronger CG/APR does not necessarily yield better SC, highlighting the need for targeted security-focused evaluation and dataset diversity. LLMSecCode thus provides a practical foundation for researchers and developers to benchmark, reproduce, and advance secure-model development in a standardized, transparent manner.

Abstract

The rapid deployment of Large Language Models (LLMs) requires careful consideration of their effect on cybersecurity. Our work aims to improve the selection process of LLMs that are suitable for facilitating Secure Coding (SC). This raises challenging research questions, such as (RQ1) Which functionality can streamline the LLM evaluation? (RQ2) What should the evaluation measure? (RQ3) How to attest that the evaluation process is impartial? To address these questions, we introduce LLMSecCode, an open-source evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying parameters and prompts, we find a 10% and 9% difference in performance, respectively. We also compare some results to reliable external actors, where our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and encourage further development by external actors. With LLMSecCode, we hope to encourage the standardization and benchmarking of LLMs' capabilities in security-oriented code and tasks.

LLMSecCode: Evaluating Large Language Models for Secure Coding

TL;DR

) and prompt design, while comparisons with external benchmarks reveal a modest, ~5% difference, indicating reasonable alignment and opportunities for standardization. The framework demonstrates that stronger CG/APR does not necessarily yield better SC, highlighting the need for targeted security-focused evaluation and dataset diversity. LLMSecCode thus provides a practical foundation for researchers and developers to benchmark, reproduce, and advance secure-model development in a standardized, transparent manner.

Abstract

Paper Structure (57 sections, 6 figures, 11 tables)

This paper contains 57 sections, 6 figures, 11 tables.

Introduction
Code Generation (CG)
Automated Program Repair (APR)
Research Questions
Problem Description
Ethical Considerations
Our Contribution
Related Work
APR experiments
CG experiments
Background
Large Language Models
Static Analyzers
CodeQL
Bandit
...and 42 more sections

Figures (6)

Figure 1: Correction chain prompt example.
Figure 2: Rough flow chart of LLMSecCode.
Figure 3: Rough flow chart of the "Evaluate model on the single dataset" component of LLMSecCode.
Figure 4: HumanEval and QuixBugs results generated with greedy decoding and max new tokens of 400 for HumanEval and 1000 for QuixBugs. The figure is sorted by best to worst performance on HumanEval.
Figure 5: CyberSecEval Instruct, SecurityEval, Vul4J and VJBench (llm-vul) results. Generated with greedy decoding and max new tokens of 800 for CyberSecEval Instruct, SecurityEval with 400 tokens, and llm-vul with 2000 tokens. The graph has the same order as in Figure \ref{['fig:heqb']}.
...and 1 more figures

LLMSecCode: Evaluating Large Language Models for Secure Coding

TL;DR

Abstract

LLMSecCode: Evaluating Large Language Models for Secure Coding

Authors

TL;DR

Abstract

Table of Contents

Figures (6)