Table of Contents
Fetching ...

EffiBench: Benchmarking the Efficiency of Automatically Generated Code

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang

TL;DR

EffiBench introduces the first benchmark dedicated to evaluating the efficiency of code produced by large language models, addressing a key sustainability concern in AI-assisted software development. By curating 1,000 efficiency-critical LeetCode problems, pairing executable canonical solutions, and generating massive test cases, EffiBench measures execution time and memory usage using a robust set of metrics (ET, NET, MU, NMU, TMU, NTMU). The empirical evaluation across 42 open- and closed-source LLMs reveals substantial efficiency gaps between generated and canonical code, with GPT-4 showing the best closed-source performance yet still significantly slower and more memory-hungry on average. The paper also provides an extensible efficiency-testing framework and a Hugging Face Space leaderboard, enabling broader adoption and fostering focus on efficiency and sustainability in code generation research.

Abstract

Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.

EffiBench: Benchmarking the Efficiency of Automatically Generated Code

TL;DR

EffiBench introduces the first benchmark dedicated to evaluating the efficiency of code produced by large language models, addressing a key sustainability concern in AI-assisted software development. By curating 1,000 efficiency-critical LeetCode problems, pairing executable canonical solutions, and generating massive test cases, EffiBench measures execution time and memory usage using a robust set of metrics (ET, NET, MU, NMU, TMU, NTMU). The empirical evaluation across 42 open- and closed-source LLMs reveals substantial efficiency gaps between generated and canonical code, with GPT-4 showing the best closed-source performance yet still significantly slower and more memory-hungry on average. The paper also provides an extensible efficiency-testing framework and a Hugging Face Space leaderboard, enabling broader adoption and fostering focus on efficiency and sustainability in code generation research.

Abstract

Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.
Paper Structure (43 sections, 12 equations, 2 figures, 9 tables)

This paper contains 43 sections, 12 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Example codes with distinct time complexity generated by Copilot and GPT-4, respectively. Code accessed on January 15, 2024.
  • Figure 2: A case illustration of GPT-3.5-turbo-0301 and canonica_solution. GPT-3.5-turbo-0301 generated code requires 70.62x memory usage compared with canonical_solution. GPT-3.5-turbo-0301 generated code employs a 2-dimensional matrix to manage state transitions, leading to substantial memory overhead, particularly evident when the parameters $n$ and $k$ are large. In contrast, the canonical_solution optimizes memory usage by utilizing a rolling sum technique and a single-dimensional dynamic array, significantly reducing the space complexity from $O(n \times k)$ to $O(k)$.