Table of Contents
Fetching ...

Evaluating Language Models for Efficient Code Generation

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, Lingming Zhang

TL;DR

The paper presents Differential Performance Evaluation (DPE), a framework to reliably evaluate the efficiency of code generated by Large Language Models. It introduces SaS for generating performance-exercising inputs, adaptive clustering to define efficiency levels, and the Differential Performance Score (DPS) to rank solutions against references. EvalPerf, a 121-task benchmark built under DPE, demonstrates that code efficiency does not strictly follow scaling laws and that instruction tuning can improve both correctness and efficiency, with reliable cross-platform results. Overall, DPE offers a principled, scalable approach to assess and compare code-generation efficiency across diverse platforms and models.

Abstract

We introduce Differential Performance Evaluation (DPE), a framework designed to reliably evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding benchmarks often fail to provide reliable insights into code efficiency, due to their reliance on simplistic test inputs and the absence of effective compound metrics. DPE addresses these issues by focusing on efficiency-demanding programming tasks and establishing an insightful compound metric for performance evaluation. DPE operates in two phases: To curate efficiency datasets, it selects efficiency-demanding tasks from existing coding benchmarks and generates computationally expensive inputs to stress the efficiency of LLM solutions. To assess the code efficiency, DPE profiles the new solution and compares it globally against a set of reference solutions that exhibit distinct efficiency levels, where the matched level defines its efficiency score. As a proof of concept, we use DPE to create EvalPerf, a benchmark with 121 performance-challenging coding tasks. Our comprehensive evaluation draws interesting findings on the efficiency impact of model sizes, instruction tuning, and prompting. For example, while the scaling law fails to account for code efficiency, general instruction tuning benefits both code correctness and efficiency. We also evaluate the evaluation by examining the effectiveness of DPE, showing that EvalPerf is reliable and convenient to use even across platforms.

Evaluating Language Models for Efficient Code Generation

TL;DR

The paper presents Differential Performance Evaluation (DPE), a framework to reliably evaluate the efficiency of code generated by Large Language Models. It introduces SaS for generating performance-exercising inputs, adaptive clustering to define efficiency levels, and the Differential Performance Score (DPS) to rank solutions against references. EvalPerf, a 121-task benchmark built under DPE, demonstrates that code efficiency does not strictly follow scaling laws and that instruction tuning can improve both correctness and efficiency, with reliable cross-platform results. Overall, DPE offers a principled, scalable approach to assess and compare code-generation efficiency across diverse platforms and models.

Abstract

We introduce Differential Performance Evaluation (DPE), a framework designed to reliably evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding benchmarks often fail to provide reliable insights into code efficiency, due to their reliance on simplistic test inputs and the absence of effective compound metrics. DPE addresses these issues by focusing on efficiency-demanding programming tasks and establishing an insightful compound metric for performance evaluation. DPE operates in two phases: To curate efficiency datasets, it selects efficiency-demanding tasks from existing coding benchmarks and generates computationally expensive inputs to stress the efficiency of LLM solutions. To assess the code efficiency, DPE profiles the new solution and compares it globally against a set of reference solutions that exhibit distinct efficiency levels, where the matched level defines its efficiency score. As a proof of concept, we use DPE to create EvalPerf, a benchmark with 121 performance-challenging coding tasks. Our comprehensive evaluation draws interesting findings on the efficiency impact of model sizes, instruction tuning, and prompting. For example, while the scaling law fails to account for code efficiency, general instruction tuning benefits both code correctness and efficiency. We also evaluate the evaluation by examining the effectiveness of DPE, showing that EvalPerf is reliable and convenient to use even across platforms.
Paper Structure (17 sections, 8 figures, 2 tables)

This paper contains 17 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of Differential Performance Evaluation
  • Figure 2: The algorithm to adaptively segment solutions for each task based on their efficiency.
  • Figure 3: Pairwise comparison of DPS with model variant pairs. Each pair of variants is compared over the common set of passing solutions. Within each block, the bottom-left number comes from the corresponding variant in the y-axis and the top-right number is for the x-axis.
  • Figure 4: Distribution of runtime variation over the runtime scale.
  • Figure 5: DPS on EvalPerfv.s.[-25]pass@$1$ on HumanEval+ and MBPP+.
  • ...and 3 more figures