Table of Contents
Fetching ...

Investigating The Smells of LLM Generated Code

Debalina Ghosh Paul, Hong Zhu, Ian Bayley

TL;DR

This work proposes a scenario-based, code-smell–driven framework to assess the quality of LLM-generated Java code by benchmarking against human-written baselines from ScenEval. It deploys an automated Morphy-based test system to run large-scale experiments across four LLMs (Gemini Pro, Falcon, ChatGPT, Codex) and analyzes smells detected via PMD, Checkstyle, and DesigniteJava to reveal systematic quality gaps. The results show that LLM-generated code exhibits substantially higher smell rates than human-written code, with average increases around 63% overall and especially large increases for more complex tasks and certain topics. These findings highlight target areas for improvement in LLM code generation and demonstrate the value of automated smell-based evaluation for guiding quality-enhancement efforts in AI-assisted programming.

Abstract

Context: Large Language Models (LLMs) are increasingly being used to generate program code. Much research has been reported on the functional correctness of generated code, but there is far less on code quality. Objectives: In this study, we propose a scenario-based method of evaluating the quality of LLM-generated code to identify the weakest scenarios in which the quality of LLM generated code should be improved. Methods: The method measures code smells, an important indicator of code quality, and compares them with a baseline formed from reference solutions of professionally written code. The test dataset is divided into various subsets according to the topics of the code and complexity of the coding tasks to represent different scenarios of using LLMs for code generation. We will also present an automated test system for this purpose and report experiments with the Java programs generated in response to prompts given to four state-of-the-art LLMs: Gemini Pro, ChatGPT, Codex, and Falcon. Results: We find that LLM-generated code has a higher incidence of code smells compared to reference solutions. Falcon performed the least badly, with a smell increase of 42.28%, followed by Gemini Pro (62.07%), ChatGPT (65.05%) and finally Codex (84.97%). The average smell increase across all LLMs was 63.34%, comprising 73.35% for implementation smells and 21.42% for design smells. We also found that the increase in code smells is greater for more complex coding tasks and for more advanced topics, such as those involving object-orientated concepts. Conclusion: In terms of code smells, LLM's performances on various coding task complexities and topics are highly correlated to the quality of human written code in the corresponding scenarios. However, the quality of LLM generated code is noticeably poorer than human written code.

Investigating The Smells of LLM Generated Code

TL;DR

This work proposes a scenario-based, code-smell–driven framework to assess the quality of LLM-generated Java code by benchmarking against human-written baselines from ScenEval. It deploys an automated Morphy-based test system to run large-scale experiments across four LLMs (Gemini Pro, Falcon, ChatGPT, Codex) and analyzes smells detected via PMD, Checkstyle, and DesigniteJava to reveal systematic quality gaps. The results show that LLM-generated code exhibits substantially higher smell rates than human-written code, with average increases around 63% overall and especially large increases for more complex tasks and certain topics. These findings highlight target areas for improvement in LLM code generation and demonstrate the value of automated smell-based evaluation for guiding quality-enhancement efforts in AI-assisted programming.

Abstract

Context: Large Language Models (LLMs) are increasingly being used to generate program code. Much research has been reported on the functional correctness of generated code, but there is far less on code quality. Objectives: In this study, we propose a scenario-based method of evaluating the quality of LLM-generated code to identify the weakest scenarios in which the quality of LLM generated code should be improved. Methods: The method measures code smells, an important indicator of code quality, and compares them with a baseline formed from reference solutions of professionally written code. The test dataset is divided into various subsets according to the topics of the code and complexity of the coding tasks to represent different scenarios of using LLMs for code generation. We will also present an automated test system for this purpose and report experiments with the Java programs generated in response to prompts given to four state-of-the-art LLMs: Gemini Pro, ChatGPT, Codex, and Falcon. Results: We find that LLM-generated code has a higher incidence of code smells compared to reference solutions. Falcon performed the least badly, with a smell increase of 42.28%, followed by Gemini Pro (62.07%), ChatGPT (65.05%) and finally Codex (84.97%). The average smell increase across all LLMs was 63.34%, comprising 73.35% for implementation smells and 21.42% for design smells. We also found that the increase in code smells is greater for more complex coding tasks and for more advanced topics, such as those involving object-orientated concepts. Conclusion: In terms of code smells, LLM's performances on various coding task complexities and topics are highly correlated to the quality of human written code in the corresponding scenarios. However, the quality of LLM generated code is noticeably poorer than human written code.

Paper Structure

This paper contains 36 sections, 3 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Structure of The Test System
  • Figure 2: The Experiment Setup
  • Figure 3: All Code Smells Detected on the Whole Test Dataset
  • Figure 4: Variation of Code Smells by Complexity
  • Figure 5: Increase Rates of Code Smells by Complexity
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3