Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

Yotam Perlitz; Ariel Gera; Ofir Arviv; Asaf Yehudai; Elron Bandel; Eyal Shnarch; Michal Shmueli-Scheuer; Leshem Choshen

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen

TL;DR

bench, a python package for BAT, and the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers are introduced, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research.

Abstract

Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of conclusions. To address these inconsistencies, we propose a set of best practices for BAT and demonstrate how utilizing these methodologies greatly improves BAT robustness and validity. To foster adoption and facilitate future research,, we introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers. Our findings underscore the necessity for standardized BAT, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research. BenchBench Package: github.com/IBM/BenchBench Leaderboard: hf.co/spaces/IBM/BenchBench

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

TL;DR

Abstract

Paper Structure (25 sections, 8 figures, 1 table)

This paper contains 25 sections, 8 figures, 1 table.

Introduction
Setup
BAT Methodological Decisions: An Analysis
The Choice of Reference Benchmark Matters
The Choice of Models Matters
The Number of Compared Models Matters
Granularity Matters
The Choice of Correlation Metric (and Threshold) Matters
BAT Best Practices
Use an Aggregate Reference Benchmark
Use a Data-driven Threshold
Use More Models and Sample Them Randomly
Report Multiple Granularities
Follow The Above Rules!
BenchBench - a Package and Leaderboard
...and 10 more sections

Figures (8)

Figure 1: Running BAT using our best practices increases consistency by 3x. The average standard deviation of BAT results over multiple instances is drastically decreased using our best practices, without incurring further computational costs. These best practices can be easily applied using our BenchBench package. Further details in Table \ref{['tab:ablations']}.
Figure 2: BAT Conclusions depend on the models considered. Kendall-tau correlations between the LMSys Arena benchmark and three other benchmarks: BBH, MMLU, and Alpaca v2. Each group of bars represents the correlation for different sets of top models, specifically the top 5, top 10, and top 15 (overlapping) models (according to the Arena). The results indicate that the degree of agreement between benchmarks varies with the number of top models considered, highlighting that different selections of models can lead to varying conclusions about benchmark agreement.
Figure 3: Agreement scores significantly vary across different appropriate reference benchmarks. Kendall-tau correlations between pairs of benchmarks that are seemingly valid for BAT. Each is taken over 20 models sampled at random.
Figure 4: Agreement is lower for closely ranked models. Mean correlation (y) between each benchmark (lines) and the rest, given different numbers of models. The Blue and Orange lines are the average of all benchmark pair correlations with models sampled randomly (orange) or in contiguous sets (blue). The shaded lines represents adjacent sampling for the the set of benchmarks listed in App \ref{['app:benchmarks_used_in_visualizations']}.
Figure 5: Agreement variance is inversely related to model subset size. The mean standard deviation of the Kendall-tau correlations arising from performing BAT using different randomly sampled model subsets. The blue line represents the benchmark mean while the other ones are for the benchmarks listed in App \ref{['app:benchmarks_used_in_visualizations']}.
...and 3 more figures

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

TL;DR

Abstract

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

Authors

TL;DR

Abstract

Table of Contents

Figures (8)