LLMSecCode: Evaluating Large Language Models for Secure Coding
Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren
TL;DR
LLMSecCode tackles the challenge of objectively benchmarking Large Language Models for secure coding by proposing an open-source evaluation framework that jointly covers Automated Program Repair (APR), Code Generation (CG), and Secure Coding (SC). It integrates diverse datasets (e.g., QuixBugs, HumanEval, Vul4J/VJBench, SecurityEval, CyberSecEval) and static analyzers (CodeQL, Bandit) within a configurable, repeatable workflow that includes model templates, prompt structures, and correction chains. Empirical results show that model performance can vary by more than 10% due to parameter choices ($top-p$, $temperature$) and prompt design, while comparisons with external benchmarks reveal a modest, ~5% difference, indicating reasonable alignment and opportunities for standardization. The framework demonstrates that stronger CG/APR does not necessarily yield better SC, highlighting the need for targeted security-focused evaluation and dataset diversity. LLMSecCode thus provides a practical foundation for researchers and developers to benchmark, reproduce, and advance secure-model development in a standardized, transparent manner.
Abstract
The rapid deployment of Large Language Models (LLMs) requires careful consideration of their effect on cybersecurity. Our work aims to improve the selection process of LLMs that are suitable for facilitating Secure Coding (SC). This raises challenging research questions, such as (RQ1) Which functionality can streamline the LLM evaluation? (RQ2) What should the evaluation measure? (RQ3) How to attest that the evaluation process is impartial? To address these questions, we introduce LLMSecCode, an open-source evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying parameters and prompts, we find a 10% and 9% difference in performance, respectively. We also compare some results to reliable external actors, where our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and encourage further development by external actors. With LLMSecCode, we hope to encourage the standardization and benchmarking of LLMs' capabilities in security-oriented code and tasks.
