Performance Review on LLM for solving leetcode problems
Lun Wang, Chuanqi Shi, Shaoshui Du, Yiyi Tao, Yixian Shen, Hang Zheng, Yanxin Shen, Xinyu Qiu
TL;DR
The paper investigates how effectively Large Language Models generate code to solve LeetCode algorithm problems, addressing the gap in evaluating runtime efficiency alongside correctness. It uses a large-scale experimental pipeline with data collection, multi-model code generation across varied temperatures, automated LeetCode submissions, and pass@k as well as runtime metrics. Key contributions include a comparative performance analysis of 18 LLMs on 204 LeetCode problems, a direct human-vs-LLM comparison using runtime percentile ranks, and an assessment of LeetCode as a research dataset with its strengths and limitations. The findings show that top models achieve very high correctness under controlled settings, yet substantial gaps remain relative to human coding, particularly in efficiency, informing the design of automated programming assistants and benchmarking practices.
Abstract
This paper presents a comprehensive performance evaluation of Large Language Models (LLMs) in solving programming challenges from Leetcode, a widely used platform for algorithm practice and technical interviews. We began by crawling the Leetcode website to collect a diverse set of problems encompassing various difficulty levels and topics. Using this dataset, we generated solutions with multiple LLMs, including GPT-4 and GPT-3.5-turbo (ChatGPT-turbo). The generated solutions were systematically evaluated for correctness and efficiency. We employed the pass@k metric to assess the success rates within a given number of attempts and analyzed the runtime performance of the solutions. Our results highlight the strengths and limitations of current LLMs [10] in code generation and problem-solving tasks, providing insights into their potential applications and areas for improvement in automated programming assistance.
