Table of Contents
Fetching ...

An evaluation of LLM code generation capabilities through graded exercises

Álvaro Barbero Jiménez

TL;DR

A new evaluation of the performance of one state-of-the-art model (GPT4-o-mini) in solving curated coding challenges in 8 programming languages, obtained from Codewars, suggests that current evaluation methodologies might be overestimating the actual skill of Large Language Models for generating functional code.

Abstract

Large Language Models have shown prominent capabilities in generating functional code from natural language descriptions. However, a standardized way to evaluate these capabilities in an objective and unbiased manner is still to be found. In this paper we review the current evaluation methods available to this end, and run a new evaluation of the performance of one state-of-the-art model (GPT4-o-mini) in solving curated coding challenges in 8 programming languages, obtained from Codewars, a software development community. Our analysis shows that the chance of success of the model has a positive correlation with the task difficulty, the popularity of the programming language being used and the time elapsed since the publication of the challenge. A further approximate explanatory analysis in terms of high-level features hints that while 46.6% of the model performance could be attributed to task difficulty, a 37.4% seems to be related to leakage of the challenge solutions into the model training set, while the remaining 16% depends on the programming language. These results suggest that current evaluation methodologies might be overestimating the actual skill of Large Language Models for generating functional code.

An evaluation of LLM code generation capabilities through graded exercises

TL;DR

A new evaluation of the performance of one state-of-the-art model (GPT4-o-mini) in solving curated coding challenges in 8 programming languages, obtained from Codewars, suggests that current evaluation methodologies might be overestimating the actual skill of Large Language Models for generating functional code.

Abstract

Large Language Models have shown prominent capabilities in generating functional code from natural language descriptions. However, a standardized way to evaluate these capabilities in an objective and unbiased manner is still to be found. In this paper we review the current evaluation methods available to this end, and run a new evaluation of the performance of one state-of-the-art model (GPT4-o-mini) in solving curated coding challenges in 8 programming languages, obtained from Codewars, a software development community. Our analysis shows that the chance of success of the model has a positive correlation with the task difficulty, the popularity of the programming language being used and the time elapsed since the publication of the challenge. A further approximate explanatory analysis in terms of high-level features hints that while 46.6% of the model performance could be attributed to task difficulty, a 37.4% seems to be related to leakage of the challenge solutions into the model training set, while the remaining 16% depends on the programming language. These results suggest that current evaluation methodologies might be overestimating the actual skill of Large Language Models for generating functional code.

Paper Structure

This paper contains 16 sections, 1 equation, 13 figures, 1 table.

Figures (13)

  • Figure 1: Flow of kata completion by a user in Codewars. The user receives the information of the kata in the form a description, a function header or template to follow in the solution, and a set of public unit tests. When user proposes a solution, it is first checked against the public unit tests, and then again a larger set of hidden tests. If the proposed solution succeeds both checks, it is accepted and added to a repository of valid solutions.
  • Figure 2: Percentage of users who manage to complete a kata, once they decide to start it, according to its difficulty level, regardless of the programming language used.
  • Figure 3: Network of bots used to download kata information, generate solutions using an OpenAI model, and verify whether the solutions are correct.
  • Figure 4: Prompts used for the coding solutions generation.
  • Figure 5: Number of katas processed in this work, divided by programming language and difficulty (rank). Lower rank means higher difficulty. Some combinations of languages and rank have no katas available in Codewars.
  • ...and 8 more figures