Mercury: A Code Efficiency Benchmark for Code Large Language Models

Mingzhe Du; Anh Tuan Luu; Bin Ji; Qian Liu; See-Kiong Ng

Mercury: A Code Efficiency Benchmark for Code Large Language Models

Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, See-Kiong Ng

TL;DR

Mercury fills a critical gap in NL2Code evaluation by introducing a code efficiency benchmark and a distribution-aware metric, Beyond, which normalizes code runtimes across diverse tasks. The dataset comprises 1,889 LeetCode-derived Python tasks with per-task test-case generators and historical solutions, enabling sandboxed, hardware-agnostic assessment. Empirical results show that Code LLMs remain strong at functional correctness, but efficiency improvements are challenging; Direct Preference Optimization (DPO) consistently boosts Beyond for larger models, while Supervised Fine-Tuning (SFT) often degrades efficiency or correctness. The authors release the Mercury dataset and framework to support ongoing research into more efficient code generation.

Abstract

Amidst the recent strides in evaluating Large Language Models for Code (Code LLMs), existing benchmarks have mainly focused on the functional correctness of generated code, neglecting the importance of their computational efficiency. To fill the gap, we present Mercury, the first code efficiency benchmark for Code LLMs. It comprises 1,889 Python tasks, each accompanied by adequate solutions that serve as real-world efficiency baselines, enabling a comprehensive analysis of the runtime distribution. Based on the distribution, we introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and code efficiency simultaneously. On Mercury, leading Code LLMs can achieve 65% on Pass, while less than 50% on Beyond. Given that an ideal Beyond score would be aligned with the Pass score, it indicates that while Code LLMs exhibit impressive capabilities in generating functionally correct code, there remains a notable gap in their efficiency. Finally, our empirical experiments reveal that Direct Preference Optimization (DPO) serves as a robust baseline for enhancing code efficiency compared with Supervised Fine Tuning (SFT), which paves a promising avenue for future exploration of efficient code generation. Our code and data are available on GitHub: https://github.com/Elfsong/Mercury.

Mercury: A Code Efficiency Benchmark for Code Large Language Models

TL;DR

Abstract

Paper Structure (35 sections, 3 equations, 11 figures, 7 tables)

This paper contains 35 sections, 3 equations, 11 figures, 7 tables.

Introduction
Mercury Datasets
Code Efficiency Metric
Experiments
Baselines
Functional Correctness Benchmarks
Experimental Setups
Empirical Results
Failure Analysis
Related Work
Limitations
Conclusion
Appendix
Dataset Nutrition Labels
Mercury Data Distribution and Customized Data Structures
...and 20 more sections

Figures (11)

Figure 1: Executing these two LLM-generated codes on 100 test cases. While both codes successfully follow the task instruction and pass all test cases, the right snippet notably excels in code efficiency, completing in a mere 121 ms compared to the 5,714 ms consumed by the left snippet. As Code LLMs become widely used in the real world, code efficiency determines factual productivity, where Mercury can gauge the vital metric.
Figure 2: An overview of Mercury dataset. Each Mercury task has a task description, a test case generator, a prompt & entry point, and corresponding solutions. To evaluate code efficiency, we introduce the Beyond metric, which signifies the runtime percentile of the LLM-generated code on the runtime distribution supported by corresponding solutions. In this example, the LLM-generated code executes in 521 ms, outpacing 86.18% of collected solutions on the runtime distribution. Consequently, the Beyond metric in this case is 86.18%.
Figure 3: The horizontal axis represents the score for functional correctness, while the vertical axis indicates the score for code efficiency. The left figure illustrates the performance of the baseline model, whereas the right one depicts the performance after DPO tuning. Model points located nearer to the diagonal line exhibit a more equitable balance between functionality and efficiency.
Figure 4: Mercury supports two customized data structures: TreeNode and ListNode.
Figure 5: Sandbox Execution Pipeline. 1) Test Case Generation. We first employ the corresponding test case generator for each task to produce a comprehensive set of test cases for the subsequent evaluation. 2) Context Initialization. To prevent any unexpected code behavior, the sandbox environment is meticulously reinitialized for each new task. This phase ensures that all the common libraries required for executing the solution are loaded. 3) Solution Instantiation. The solution under evaluation will be encapsulated as a solution class. 4) Test Case Evaluation. Each test case the generator provides will be rigorously executed against the solution. A solution must successfully pass all the test cases to be deemed valid. 5) Clean up. The final stage involves the sandbox dutifully clearing the namespace environment and the temporary directory. Mercury records the time consumed during the stage of Solution instantiation and Test Ease Evaluation as the primary metric for assessing code efficiency.
...and 6 more figures

Mercury: A Code Efficiency Benchmark for Code Large Language Models

TL;DR

Abstract

Mercury: A Code Efficiency Benchmark for Code Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)