Do Prompt Patterns Affect Code Quality? A First Empirical Assessment of ChatGPT-Generated Code
Antonio Della Porta, Stefano Lambiase, Fabio Palomba
TL;DR
The paper tackles whether prompt patterns used to request code from LLMs affect the resulting code quality in terms of maintainability, reliability, and security. It conducts an empirical study over the refined DevGPT dataset, classifying prompts into patterns (Zero-shot, Few-shot, Chain-of-Thought, Personas) and assessing generated code with SonarQube, using Kruskal-Wallis tests to detect differences. The main finding is that there are no statistically significant differences in quality metrics across prompt patterns ($p>0.05$ for all), suggesting prompt structure alone has limited impact on code quality in this setting. The work contributes an improved DevGPT dataset, an automated prompt-pattern classification approach, and an open appendix for reproducibility, informing practitioners that simple prompting can often yield satisfactory results while motivating future research into richer metrics and diverse tasks.
Abstract
Large Language Models (LLMs) have rapidly transformed software development, especially in code generation. However, their inconsistent performance, prone to hallucinations and quality issues, complicates program comprehension and hinders maintainability. Research indicates that prompt engineering-the practice of designing inputs to direct LLMs toward generating relevant outputs-may help address these challenges. In this regard, researchers have introduced prompt patterns, structured templates intended to guide users in formulating their requests. However, the influence of prompt patterns on code quality has yet to be thoroughly investigated. An improved understanding of this relationship would be essential to advancing our collective knowledge on how to effectively use LLMs for code generation, thereby enhancing their understandability in contemporary software development. This paper empirically investigates the impact of prompt patterns on code quality, specifically maintainability, security, and reliability, using the Dev-GPT dataset. Results show that Zero-Shot prompting is most common, followed by Zero-Shot with Chain-of-Thought and Few-Shot. Analysis of 7583 code files across quality metrics revealed minimal issues, with Kruskal-Wallis tests indicating no significant differences among patterns, suggesting that prompt structure may not substantially impact these quality metrics in ChatGPT-assisted code generation.
