How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

Ruizhong Qiu; Weiliang Will Zeng; James Ezick; Christopher Lott; Hanghang Tong

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

Ruizhong Qiu, Weiliang Will Zeng, James Ezick, Christopher Lott, Hanghang Tong

TL;DR

This work introduces ENAMEL, a rigorous benchmark for evaluating the efficiency of LLM‑generated code. It defines a novel efficiency metric, $eff@k$, that generalizes pass@k to capture both correctness and execution efficiency under right‑censoring, and provides an unbiased, variance‑reduced estimator via Rao–Blackwellization. By leveraging expert‑written reference solutions and strong test case generators, ENAMEL sets a high standard for efficiency evaluations and yields a nuanced picture of current LLM capabilities across 30 models. The empirical study reveals that state‑of‑the‑art models lag behind expert‑level efficiency, highlighting a critical gap between code correctness and optimization, with implications for deployment efficiency and future model development.

Abstract

The emergence of large language models (LLMs) has significantly pushed the frontiers of program synthesis. Advancement of LLM-based program synthesis calls for a thorough evaluation of LLM-generated code. Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations. In this work, we develop ENAMEL (EfficeNcy AutoMatic EvaLuator), a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code. Firstly, we propose a new efficiency metric called eff@k, which generalizes the pass@k metric from correctness to efficiency and appropriately handles right-censored execution time. Furthermore, we derive an unbiased and variance-reduced estimator of eff@k via Rao--Blackwellization; we also provide a numerically stable implementation for the new estimator. Secondly, to set a high-standard for efficiency evaluation, we employ a human expert to design best algorithms and implementations as our reference solutions of efficiency, many of which are much more efficient than existing canonical solutions in HumanEval and HumanEval+. Moreover, to ensure a rigorous evaluation, we employ a human expert to curate strong test case generators to filter out wrong code and differentiate suboptimal algorithms. An extensive study across 30 popular LLMs using our benchmark ENAMEL shows that LLMs still fall short of generating expert-level efficient code. Using two subsets of our problem set, we demonstrate that such deficiency is because current LLMs struggle in designing advanced algorithms and are barely aware of implementation optimization. Our benchmark is publicly available at https://github.com/q-rz/enamel .

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

TL;DR

This work introduces ENAMEL, a rigorous benchmark for evaluating the efficiency of LLM‑generated code. It defines a novel efficiency metric,

, that generalizes pass@k to capture both correctness and execution efficiency under right‑censoring, and provides an unbiased, variance‑reduced estimator via Rao–Blackwellization. By leveraging expert‑written reference solutions and strong test case generators, ENAMEL sets a high standard for efficiency evaluations and yields a nuanced picture of current LLM capabilities across 30 models. The empirical study reveals that state‑of‑the‑art models lag behind expert‑level efficiency, highlighting a critical gap between code correctness and optimization, with implications for deployment efficiency and future model development.

Abstract

Paper Structure (37 sections, 1 theorem, 11 equations, 2 figures, 15 tables, 1 algorithm)

This paper contains 37 sections, 1 theorem, 11 equations, 2 figures, 15 tables, 1 algorithm.

Introduction
Evaluation framework
Level-based evaluation
Efficiency score of a code sample
Efficiency metric for an LLM
Benchmark development
Problem selection
Efficient reference solutions
Strong test case generators
Evaluation
Main results & analysis
Analysis on algorithm design & implementation optimization
Distribution of problem difficulties
Related work
Conclusion
...and 22 more sections

Key Result

Theorem 1

Suppose that problem $i$ has time limit $T_i<\infty$ and reference execution times $t^*_{i,l,m}<T_i$. Under the randomness of code generation and execution, for $n\ge k$, we have:

Figures (2)

Figure 1: Illustration of our ENAMEL framework with HumanEval problem #55 (computing the $n$-th Fibonacci number). Our level-based evaluation clearly differentiates the three algorithms: (i) a naïve algorithm that needs $2^{\Theta(n)}$ recursions, (ii) a dynamic programming algorithm that needs $\Theta(n)$ iterations, and (iii) an efficient doubling algorithm that needs only $\Theta(\log n)$ iterations.
Figure 2: Distribution of problem difficulties (best viewed in color). High pass$_{i}$@1 but low eff$_{i}$@1 means problem $i$ has a seemingly easy task but a non-trivial efficient algorithm / implementation.

Theorems & Definitions (1)

Theorem 1

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

TL;DR

Abstract

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (1)