Table of Contents
Fetching ...

A case study on the transformative potential of AI in software engineering on LeetCode and ChatGPT

Manuel Merkel, Jens Dörpinghaus

TL;DR

This study conducts a large-scale comparison between AI-generated and human-written Python solutions to LeetCode problems, using GPT-4o and LeetCode submissions as the testbed. It applies web scraping and data mining to assemble 2,321 problems, with 2,086 valid AI solutions and 57,238 human solutions, and evaluates four software-quality metrics via SonarQube and LeetCode data: code quality (code smells per LOC), code understandability (cognitive complexity per LOC), time behaviour (runtime rank), and resource utilisation (memory usage rank). The results show that GPT-4o produces code with significantly fewer code smells and lower cognitive complexity, and it achieves faster runtimes, though memory usage does not show a clear GenAI advantage. These findings highlight GenAI's potential to augment software engineering tasks at scale while underscoring limitations in generalization and data contamination concerns, and they provide a replicable framework and dataset for future cross-language and cross-GenAI investigations.

Abstract

The recent surge in the field of generative artificial intelligence (GenAI) has the potential to bring about transformative changes across a range of sectors, including software engineering and education. As GenAI tools, such as OpenAI's ChatGPT, are increasingly utilised in software engineering, it becomes imperative to understand the impact of these technologies on the software product. This study employs a methodological approach, comprising web scraping and data mining from LeetCode, with the objective of comparing the software quality of Python programs produced by LeetCode users with that generated by GPT-4o. In order to gain insight into these matters, this study addresses the question whether GPT-4o produces software of superior quality to that produced by humans. The findings indicate that GPT-4o does not present a considerable impediment to code quality, understandability, or runtime when generating code on a limited scale. Indeed, the generated code even exhibits significantly lower values across all three metrics in comparison to the user-written code. However, no significantly superior values were observed for the generated code in terms of memory usage in comparison to the user code, which contravened the expectations. Furthermore, it will be demonstrated that GPT-4o encountered challenges in generalising to problems that were not included in the training data set. This contribution presents a first large-scale study comparing generated code with human-written code based on LeetCode platform based on multiple measures including code quality, code understandability, time behaviour and resource utilisation. All data is publicly available for further research.

A case study on the transformative potential of AI in software engineering on LeetCode and ChatGPT

TL;DR

This study conducts a large-scale comparison between AI-generated and human-written Python solutions to LeetCode problems, using GPT-4o and LeetCode submissions as the testbed. It applies web scraping and data mining to assemble 2,321 problems, with 2,086 valid AI solutions and 57,238 human solutions, and evaluates four software-quality metrics via SonarQube and LeetCode data: code quality (code smells per LOC), code understandability (cognitive complexity per LOC), time behaviour (runtime rank), and resource utilisation (memory usage rank). The results show that GPT-4o produces code with significantly fewer code smells and lower cognitive complexity, and it achieves faster runtimes, though memory usage does not show a clear GenAI advantage. These findings highlight GenAI's potential to augment software engineering tasks at scale while underscoring limitations in generalization and data contamination concerns, and they provide a replicable framework and dataset for future cross-language and cross-GenAI investigations.

Abstract

The recent surge in the field of generative artificial intelligence (GenAI) has the potential to bring about transformative changes across a range of sectors, including software engineering and education. As GenAI tools, such as OpenAI's ChatGPT, are increasingly utilised in software engineering, it becomes imperative to understand the impact of these technologies on the software product. This study employs a methodological approach, comprising web scraping and data mining from LeetCode, with the objective of comparing the software quality of Python programs produced by LeetCode users with that generated by GPT-4o. In order to gain insight into these matters, this study addresses the question whether GPT-4o produces software of superior quality to that produced by humans. The findings indicate that GPT-4o does not present a considerable impediment to code quality, understandability, or runtime when generating code on a limited scale. Indeed, the generated code even exhibits significantly lower values across all three metrics in comparison to the user-written code. However, no significantly superior values were observed for the generated code in terms of memory usage in comparison to the user code, which contravened the expectations. Furthermore, it will be demonstrated that GPT-4o encountered challenges in generalising to problems that were not included in the training data set. This contribution presents a first large-scale study comparing generated code with human-written code based on LeetCode platform based on multiple measures including code quality, code understandability, time behaviour and resource utilisation. All data is publicly available for further research.
Paper Structure (37 sections, 8 figures, 13 tables)

This paper contains 37 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Research process of web scraping and data mining
  • Figure 2: Distribution of problems across categories
  • Figure 3: Example prompts for OpenAI API
  • Figure 4: Heatmap of questions by id and retries until the solution is accepted
  • Figure 5: Boxplot of code smells per kLOC of the generated solutions and the user solutions
  • ...and 3 more figures