GitHub Copilot: the perfect Code compLeeter?

Ilja Siroš; Dave Singelée; Bart Preneel

GitHub Copilot: the perfect Code compLeeter?

Ilja Siroš, Dave Singelée, Bart Preneel

TL;DR

This paper presents a large-scale automated evaluation of GitHub Copilot's generated code on LeetCode across four languages (Java, C++, Python3, Rust). It deploys an end-to-end pipeline to collect, submit, and assess thousands of Copilot-generated solutions, analyzing reliability, correctness, and time/memory efficiency relative to human submissions. The results show language-dependent performance (Java/C++ outperforming Python3/Rust), and reveal that the ranking of Copilot suggestions does not always identify the best solution; topic-related trends emerge, with Bucket Sort solving best and Tree the hardest. The findings suggest Copilot can produce more efficient code than the average human, while highlighting limitations in Python3 and the importance of exploring multiple generated solutions. These insights inform practical use of Copilot and guide future improvements across languages and problem contexts.

Abstract

This paper aims to evaluate GitHub Copilot's generated code quality based on the LeetCode problem set using a custom automated framework. We evaluate the results of Copilot for 4 programming languages: Java, C++, Python3 and Rust. We aim to evaluate Copilot's reliability in the code generation stage, the correctness of the generated code and its dependency on the programming language, problem's difficulty level and problem's topic. In addition to that, we evaluate code's time and memory efficiency and compare it to the average human results. In total, we generate solutions for 1760 problems for each programming language and evaluate all the Copilot's suggestions for each problem, resulting in over 50000 submissions to LeetCode spread over a 2-month period. We found that Copilot successfully solved most of the problems. However, Copilot was rather more successful in generating code in Java and C++ than in Python3 and Rust. Moreover, in case of Python3 Copilot proved to be rather unreliable in the code generation phase. We also discovered that Copilot's top-ranked suggestions are not always the best. In addition, we analysed how the topic of the problem impacts the correctness rate. Finally, based on statistics information from LeetCode, we can conclude that Copilot generates more efficient code than an average human.

GitHub Copilot: the perfect Code compLeeter?

TL;DR

Abstract

Paper Structure (19 sections, 6 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 6 figures, 8 tables, 1 algorithm.

Introduction
Background and Related Work
Design
Evaluation flow
Get the list of all LeetCode problems
Request and parse problem content
Invoke Copilot and save generated solutions
Form a solution and submit to LeetCode
Check the submission result and save it
Results
How does the correctness of Copilot's solution depend on the programming language?
How does the correctness of Copilot's solution depend on the difficulty of the problem?
How does the correctness of Copilot's solution depend on its rank in the code generation phase?
How does the correctness of Copilot's solution depend on the topic of the problem?
Discussion
...and 4 more sections

Figures (6)

Figure 1: Example of a LeetCode problem. On the left part of the figure, the problem description is shown and some examples are mentioned together with the constraints imposed on the solution by LeetCode. The right part of the figure shows the solution template for the chosen programming language, containing the function that needs to be implemented.
Figure 2: The code of the problem before invoking Copilot. The problem's description is added as a comment above the code template to give Copilot a context to generate a solution. Before invoking Copilot, the mouse cursor is placed inside the function that we want Copilot to generate. In this example, it is line 28.
Figure 3: File with Copilot suggestions. Copilot provides multiple suggestions for the request to generate code. In this case, the number of generated suggestions is stated on line 1 and equals 10. These suggestions are ranked from the best to the worst by Copilot. In this paper, we refer to the first suggestion as Rank 0, the second as Rank 1 an so on. In this figure, only Rank 0 and Rank 1 suggestions are shown.
Figure 4: The head of the table with the submission results. The full table has over 50000 rows and is publicly available with the code.
Figure 5: An example of a problem which has an image in the description. This image is aimed to help to understand the examples below. In our research Copilot had no access to information from the images. Therefore, it might be why Copilot performed worse for the problems containing images in their description.
...and 1 more figures

GitHub Copilot: the perfect Code compLeeter?

TL;DR

Abstract

GitHub Copilot: the perfect Code compLeeter?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)