Table of Contents
Fetching ...

A Performance Study of LLM-Generated Code on Leetcode

Tristan Coignion, Clément Quinton, Romain Rouvoy

TL;DR

This research introduces a novel method for measuring and comparing the speed of LLM-generated code, revealing that LLMs produce code with comparable performance, irrespective of the adopted LLM.

Abstract

This study evaluates the efficiency of code generation by Large Language Models (LLMs) and measures their performance against human-crafted solutions using a dataset from Leetcode. We compare 18 LLMs, considering factors such as model temperature and success rate, and their impact on code performance. This research introduces a novel method for measuring and comparing the speed of LLM-generated code, revealing that LLMs produce code with comparable performance, irrespective of the adopted LLM. We also find that LLMs are capable of generating code that is, on average, more efficient than the code written by humans. The paper further discusses the use of Leetcode as a benchmarking dataset, the limitations imposed by potential data contamination, and the platform's measurement reliability. We believe that our findings contribute to a better understanding of LLM capabilities in code generation and set the stage for future optimizations in the field.

A Performance Study of LLM-Generated Code on Leetcode

TL;DR

This research introduces a novel method for measuring and comparing the speed of LLM-generated code, revealing that LLMs produce code with comparable performance, irrespective of the adopted LLM.

Abstract

This study evaluates the efficiency of code generation by Large Language Models (LLMs) and measures their performance against human-crafted solutions using a dataset from Leetcode. We compare 18 LLMs, considering factors such as model temperature and success rate, and their impact on code performance. This research introduces a novel method for measuring and comparing the speed of LLM-generated code, revealing that LLMs produce code with comparable performance, irrespective of the adopted LLM. We also find that LLMs are capable of generating code that is, on average, more efficient than the code written by humans. The paper further discusses the use of Leetcode as a benchmarking dataset, the limitations imposed by potential data contamination, and the platform's measurement reliability. We believe that our findings contribute to a better understanding of LLM capabilities in code generation and set the stage for future optimizations in the field.
Paper Structure (25 sections, 1 equation, 8 figures, 2 tables)

This paper contains 25 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Example of problem's input prompt, generated code, and benchmarking code.
  • Figure 2: Average pass@1 of the evaluated for every difficulty and dataset, with 95% confidence interval (higher is better)
  • Figure 3: Coefficicent of variation of the time measured by Leetcode and locally for every problem using canonical solutions
  • Figure 4: Scatter plot of the measures done by Leetcode and locally for every generation for the problem "Difference between element sum and digit sum of an array". Orange points are from multiple measures of the same canonical solution and serve as visual references for the measurement error
  • Figure 5: Scatter plot of 's rank and date they were tested on Leetcode. The two models in red are the same model tested on different dates
  • ...and 3 more figures