Table of Contents
Fetching ...

Comparing Human and LLM Generated Code: The Jury is Still Out!

Sherlock A. Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, Kla Tantithamthavorn

TL;DR

The paper compares Python code produced by GPT-4 against human-written solutions across 72 programming tasks, evaluating coding standards, security, complexity, and functional correctness. It employs Pylint, Radon, Bandit, and Pytest to quantify quality across dimensions and uses zero-shot prompting to ensure fair comparisons. Results show humans outperform in adherence to coding standards and documentation, while GPT-4 often achieves higher test-passing rates and generates more complex code with notable security flaws in both sources. The findings support a cautious, collaborative approach where LLMs augment human developers under rigorous review to maintain safety and maintainability, and they outline a roadmap for future research across models and domains.

Abstract

Much is promised in relation to AI-supported software development. However, there has been limited evaluation effort in the research domain aimed at validating the true utility of such techniques, especially when compared to human coding outputs. We bridge this gap, where a benchmark dataset comprising 72 distinct software engineering tasks is used to compare the effectiveness of large language models (LLMs) and human programmers in producing Python software code. GPT-4 is used as a representative LLM, where for the code generated by humans and this LLM, we evaluate code quality and adherence to Python coding standards, code security and vulnerabilities, code complexity and functional correctness. We use various static analysis benchmarks, including Pylint, Radon, Bandit and test cases. Among the notable outcomes, results show that human-generated code recorded higher ratings for adhering to coding standards than GPT-4. We observe security flaws in code generated by both humans and GPT-4, however, code generated by humans shows a greater variety of problems, but GPT-4 code included more severe outliers. Our results show that although GPT-4 is capable of producing coding solutions, it frequently produces more complex code that may need more reworking to ensure maintainability. On the contrary however, our outcomes show that a higher number of test cases passed for code generated by GPT-4 across a range of tasks than code that was generated by humans. That said, GPT-4 frequently struggles with complex problem-solving that involve in-depth domain knowledge. This study highlights the potential utility of LLMs for supporting software development, however, tasks requiring comprehensive, innovative or unconventional solutions, and careful debugging and error correction seem to be better developed by human programmers. We plot an agenda for the software engineering community.

Comparing Human and LLM Generated Code: The Jury is Still Out!

TL;DR

The paper compares Python code produced by GPT-4 against human-written solutions across 72 programming tasks, evaluating coding standards, security, complexity, and functional correctness. It employs Pylint, Radon, Bandit, and Pytest to quantify quality across dimensions and uses zero-shot prompting to ensure fair comparisons. Results show humans outperform in adherence to coding standards and documentation, while GPT-4 often achieves higher test-passing rates and generates more complex code with notable security flaws in both sources. The findings support a cautious, collaborative approach where LLMs augment human developers under rigorous review to maintain safety and maintainability, and they outline a roadmap for future research across models and domains.

Abstract

Much is promised in relation to AI-supported software development. However, there has been limited evaluation effort in the research domain aimed at validating the true utility of such techniques, especially when compared to human coding outputs. We bridge this gap, where a benchmark dataset comprising 72 distinct software engineering tasks is used to compare the effectiveness of large language models (LLMs) and human programmers in producing Python software code. GPT-4 is used as a representative LLM, where for the code generated by humans and this LLM, we evaluate code quality and adherence to Python coding standards, code security and vulnerabilities, code complexity and functional correctness. We use various static analysis benchmarks, including Pylint, Radon, Bandit and test cases. Among the notable outcomes, results show that human-generated code recorded higher ratings for adhering to coding standards than GPT-4. We observe security flaws in code generated by both humans and GPT-4, however, code generated by humans shows a greater variety of problems, but GPT-4 code included more severe outliers. Our results show that although GPT-4 is capable of producing coding solutions, it frequently produces more complex code that may need more reworking to ensure maintainability. On the contrary however, our outcomes show that a higher number of test cases passed for code generated by GPT-4 across a range of tasks than code that was generated by humans. That said, GPT-4 frequently struggles with complex problem-solving that involve in-depth domain knowledge. This study highlights the potential utility of LLMs for supporting software development, however, tasks requiring comprehensive, innovative or unconventional solutions, and careful debugging and error correction seem to be better developed by human programmers. We plot an agenda for the software engineering community.

Paper Structure

This paper contains 25 sections, 5 figures.

Figures (5)

  • Figure 1: Comparison of Human vs LLM Pylint Scores in box plots
  • Figure 2: Comparison of Human vs LLM Bandit Scores in a scatter plot
  • Figure 3: Comparison of Human vs LLM Bandit Scores in box plots
  • Figure 4: Comparison of Human vs LLM Radon Scores in a box plots
  • Figure 6: Comparison of Human vs LLM Test Scores in box plots