Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Jiexin Wang; Xitong Luo; Liuwen Cao; Hongkui He; Hailin Huang; Jiayuan Xie; Adam Jatowt; Yi Cai

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Jiexin Wang, Xitong Luo, Liuwen Cao, Hongkui He, Hailin Huang, Jiayuan Xie, Adam Jatowt, Yi Cai

TL;DR

This work tackles the problem of security in AI-generated code by introducing CodeSecEval, a dataset with 180 Python samples spanning 44 vulnerability types, designed to enable precise automatic evaluation of code generation and repair. It evaluates seven state-of-the-art LLMs and demonstrates that current systems frequently overlook security during both generation and repair; vulnerability-aware problem formulations and insecure code explanations significantly improve security outcomes, though effectiveness varies by vulnerability type. The authors propose and validate strategies to mitigate vulnerabilities and emphasize the need for more robust training and deployment practices. Overall, CodeSecEval provides a practical benchmark to advance safer codegeneration and repair in real-world software engineering.

Abstract

Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap in comprehensively addressing their security features. In this work, we aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs. To support our research, we introduce CodeSecEval, a meticulously curated dataset designed to address 44 critical vulnerability types with 180 distinct samples. CodeSecEval serves as the foundation for the automatic evaluation of code models in two crucial tasks: code generation and code repair, with a strong emphasis on security. Our experimental results reveal that current models frequently overlook security issues during both code generation and repair processes, resulting in the creation of vulnerable code. In response, we propose different strategies that leverage vulnerability-aware information and insecure code explanations to mitigate these security vulnerabilities. Furthermore, our findings highlight that certain vulnerability types particularly challenge model performance, influencing their effectiveness in real-world applications. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 4 figures, 4 tables)

This paper contains 15 sections, 1 equation, 4 figures, 4 tables.

Introduction
Related Work
Security Issue of LLMs
Datasets for code security
Study Design
CodeSecEval
Dataset Introduction
Dataset Construction
Assumptions for Vulnerability Mitigation in Code Generation and Code Repair
Experimental Setup
Designed Experiments
Tested Models
Metrics
Results Discussion
Conclusions And Future Work

Figures (4)

Figure 1: Illustrative examples of the CodeSecEval dataset, comprising two data instances from its two sub-datasets. The attributes displayed with a white background correspond to the standard attributes of the CodeSecEval dataset. In contrast, the attributes with a gray background are those introduced specifically, that our investigation aims to validate whether they can effectively mitigate vulnerabilities, as discussed in Section 3.2.
Figure 2: The flowchart of the manual filtering process.
Figure 3: Code Generation performance results of GPT-4 across 14 vulnerability types on the SecEvalPlus sub-datasets, under two different experimental settings.
Figure 4: Code Repair performance results of GPT-4 across 14 vulnerability types on the SecEvalPlus sub-datasets, under two different experimental settings.

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

TL;DR

Abstract

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Authors

TL;DR

Abstract

Table of Contents

Figures (4)