How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs

Jialun Cao; Yuk-Kit Chan; Zixuan Ling; Wenxuan Wang; Shuqing Li; Mingwei Liu; Ruixi Qiao; Yuting Han; Chaozheng Wang; Boxi Yu; Pinjia He; Shuai Wang; Zibin Zheng; Michael R. Lyu; Shing-Chi Cheung

How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs

Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, Zibin Zheng, Michael R. Lyu, Shing-Chi Cheung

TL;DR

The paper tackles the lack of standardized guidelines for code-related benchmarks used to evaluate LLMs. It introduces How2Bench, a 55-criteria, lifecycle-oriented guideline, and applies it to profile 274 benchmarks to identify widespread data quality, reproducibility, and transparency problems. A human study with 49 researchers assesses practicality and reveals knowledge gaps, underscoring the need for systematic benchmark governance. The work provides actionable guidelines, a printable appendix, and evidence that adopting these standards could enhance the reliability and comparability of future benchmarks.

Abstract

Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.

How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs

TL;DR

Abstract

How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (52)