Table of Contents
Fetching ...

Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis

Mohammed Kharma, Soohyeon Choi, Mohammed AlKhanafseh, David Mohaisen

TL;DR

This study evaluates security and quality of LLM-generated code across four languages (Python, Java, C, C++) using a 200-task, six-category dataset and five LLMs. It integrates manual semantic evaluation with static analysis (SonarQube) to quantify compilation success, correctness, reliability, maintainability, and security, revealing substantial language-dependent gaps and occasional underutilization of modern language features such as Java 17. The findings show Python and Java generally yield higher compilability and more favorable security profiles than C/C++, while C and C++ exhibit more memory- and cryptography-related vulnerabilities, underscoring language-specific risks in AI-generated code. The work highlights practical implications for safer AI-assisted software development and suggests targeted improvements in model training and prompting to better capture language-specific best practices.

Abstract

Artificial Intelligence (AI)-driven code generation tools are increasingly used throughout the software development lifecycle to accelerate coding tasks. However, the security of AI-generated code using Large Language Models (LLMs) remains underexplored, with studies revealing various risks and weaknesses. This paper analyzes the security of code generated by LLMs across different programming languages. We introduce a dataset of 200 tasks grouped into six categories to evaluate the performance of LLMs in generating secure and maintainable code. Our research shows that while LLMs can automate code creation, their security effectiveness varies by language. Many models fail to utilize modern security features in recent compiler and toolkit updates, such as Java 17. Moreover, outdated methods are still commonly used, particularly in C++. This highlights the need for advancing LLMs to enhance security and quality while incorporating emerging best practices in programming languages.

Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis

TL;DR

This study evaluates security and quality of LLM-generated code across four languages (Python, Java, C, C++) using a 200-task, six-category dataset and five LLMs. It integrates manual semantic evaluation with static analysis (SonarQube) to quantify compilation success, correctness, reliability, maintainability, and security, revealing substantial language-dependent gaps and occasional underutilization of modern language features such as Java 17. The findings show Python and Java generally yield higher compilability and more favorable security profiles than C/C++, while C and C++ exhibit more memory- and cryptography-related vulnerabilities, underscoring language-specific risks in AI-generated code. The work highlights practical implications for safer AI-assisted software development and suggests targeted improvements in model training and prompting to better capture language-specific best practices.

Abstract

Artificial Intelligence (AI)-driven code generation tools are increasingly used throughout the software development lifecycle to accelerate coding tasks. However, the security of AI-generated code using Large Language Models (LLMs) remains underexplored, with studies revealing various risks and weaknesses. This paper analyzes the security of code generated by LLMs across different programming languages. We introduce a dataset of 200 tasks grouped into six categories to evaluate the performance of LLMs in generating secure and maintainable code. Our research shows that while LLMs can automate code creation, their security effectiveness varies by language. Many models fail to utilize modern security features in recent compiler and toolkit updates, such as Java 17. Moreover, outdated methods are still commonly used, particularly in C++. This highlights the need for advancing LLMs to enhance security and quality while incorporating emerging best practices in programming languages.

Paper Structure

This paper contains 19 sections, 10 tables.