Table of Contents
Fetching ...

On Evaluating the Efficiency of Source Code Generated by LLMs

Changan Niu, Ting Zhang, Chuanyi Li, Bin Luo, Vincent Ng

TL;DR

This work investigates the runtime efficiency of code generated by LLMs beyond mere correctness. It introduces an empirical framework using HumanEval, MBPP, and a LeetCode-based benchmark (LeetCodeEval) with a Gem5-based evaluation to measure execution performance and assess how prompting strategies influence efficiency. Key findings show that correctness and efficiency are not tightly correlated, model size alone does not guarantee faster code, and structured prompting, including chain-of-thought approaches, can boost efficiency on complex tasks. The study provides practical guidance for model selection and prompt design to prioritize execution efficiency in LLM-assisted coding, and lays groundwork for future efficiency-focused prompt methodologies.

Abstract

Recent years have seen the remarkable capabilities of large language models (LLMs) for code generation. Different from existing work that evaluate the correctness of the code generated by LLMs, we propose to further evaluate its efficiency. More efficient code can lead to higher performance and execution efficiency of programs and software completed by LLM-assisted programming. First, we evaluate the efficiency of the code generated by LLMs on two benchmarks, HumanEval and MBPP. Then, we choose a set of programming problems from the online judge platform LeetCode to conduct a more difficult evaluation. Finally, we explore several prompts that would enable LLMs to generate more efficient code.

On Evaluating the Efficiency of Source Code Generated by LLMs

TL;DR

This work investigates the runtime efficiency of code generated by LLMs beyond mere correctness. It introduces an empirical framework using HumanEval, MBPP, and a LeetCode-based benchmark (LeetCodeEval) with a Gem5-based evaluation to measure execution performance and assess how prompting strategies influence efficiency. Key findings show that correctness and efficiency are not tightly correlated, model size alone does not guarantee faster code, and structured prompting, including chain-of-thought approaches, can boost efficiency on complex tasks. The study provides practical guidance for model selection and prompt design to prioritize execution efficiency in LLM-assisted coding, and lays groundwork for future efficiency-focused prompt methodologies.

Abstract

Recent years have seen the remarkable capabilities of large language models (LLMs) for code generation. Different from existing work that evaluate the correctness of the code generated by LLMs, we propose to further evaluate its efficiency. More efficient code can lead to higher performance and execution efficiency of programs and software completed by LLM-assisted programming. First, we evaluate the efficiency of the code generated by LLMs on two benchmarks, HumanEval and MBPP. Then, we choose a set of programming problems from the online judge platform LeetCode to conduct a more difficult evaluation. Finally, we explore several prompts that would enable LLMs to generate more efficient code.
Paper Structure (12 sections, 3 figures, 4 tables)

This paper contains 12 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Code snippets extracted from the LLM-generated code for HumanEval.
  • Figure 2: The prompt template for LeetCodeEval.
  • Figure 3: Three prompt methods.