Comparing large language models and human programmers for generating programming code

Wenpin Hou; Zhicheng Ji

Comparing large language models and human programmers for generating programming code

Wenpin Hou, Zhicheng Ji

TL;DR

GPT‐4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors and the computational efficiency of the code generated by GPT‐4 is comparable to that of human programmers.

Abstract

We systematically evaluated the performance of seven large language models in generating programming code using various prompt strategies, programming languages, and task difficulties. GPT-4 substantially outperforms other large language models, including Gemini Ultra and Claude 2. The coding performance of GPT-4 varies considerably with different prompt strategies. In most LeetCode and GeeksforGeeks coding contests evaluated in this study, GPT-4 employing the optimal prompt strategy outperforms 85 percent of human participants. Additionally, GPT-4 demonstrates strong capabilities in translating code between different programming languages and in learning from past errors. The computational efficiency of the code generated by GPT-4 is comparable to that of human programmers. These results suggest that GPT-4 has the potential to serve as a reliable assistant in programming code generation and software development.

Comparing large language models and human programmers for generating programming code

TL;DR

Abstract

Paper Structure (21 sections, 5 figures)

This paper contains 21 sections, 5 figures.

Introduction
Results
Discussion
Methods
Acknowledgments
Author contributions
Competing interests
Data availability

Figures (5)

Figure 1: A schematic illustrating the prompt strategies that were developed and assessed in this study.
Figure 2: Five-attempt and one-attempt success rates for tasks with varying difficulty levels. Each row represents a different prompt strategy and a different LLM. The text within each bar displays the success rate, the number of programming tasks solved, and the total number of tasks. Rows are ordered according to the average success rates for both five attempts and one attempt across all tasks.
Figure 3: Comparing coding performances of LLMs and human programmers. a, Percentile rank (x-axis) of LLMs (y-axis) for LeetCode and GeeksforGeeks contests. Each dot is a coding contest. b, Percentile rank of GPT-4 for LeetCode and GeeksforGeeks contests, colored differently. Each row represents a contest. Texts in the bar show the percentile rank of GPT-4, the rank of GPT-4, and total number of participants. c, LLM success rates (y-axis) for LeetCode programming tasks categorized by proportion of human successfully solving the task (x-axis). LLMs are represented by different colors.
Figure 4: The salvage rates (y-axis) increase with the number of attempts (x-axis) for GPT-4 using different prompt strategies (a) and for different LLMs (b). In b, the feedback CI prompt strategy was used for GPT-4, and the prompt strategies for other LLMs were the same as those in Figure 2.
Figure 5: Evaluation of GPT-4's abilities in translating across programming languages and the computational efficiency of the generated code. a, A schematic illustrates a proposed strategy to enhance the success rate of non-Python3 programming languages using translation. b, The success rate of translating Python3 code, generated from one-attempt feedback CI prompts, to other programming languages for tasks of varying difficulty levels. The top and bottom panels display results for when the original Python3 code is correct or incorrect, respectively. Since Python3 code generated by GPT-4 is correct for all easy tasks, translation is not evaluated for easy tasks when the Python3 code is incorrect. c, Running time and memory usage percentile for GPT-4 generated code compared to human programmers before and after optimization by GPT-4. A higher percentile represents lower running time and memory usage, indicating better computational efficiency.

Comparing large language models and human programmers for generating programming code

TL;DR

Abstract

Comparing large language models and human programmers for generating programming code

Authors

TL;DR

Abstract

Table of Contents

Figures (5)