Table of Contents
Fetching ...

Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

Shaobo Wang, Cong Wang, Wenjie Fu, Yue Min, Mingquan Feng, Isabel Guan, Xuming Hu, Conghui He, Cunxiang Wang, Kexin Yang, Xingzhang Ren, Fei Huang, Dayiheng Liu, Linfeng Zhang

TL;DR

The paper tackles the rising cost of evaluating large language models by identifying pervasive redundancy in benchmark samples. It introduces EssenceBench, a coarse-to-fine benchmark compression framework that combines redundancy-aware filtering with an iterative genetic-algorithm search guided by a GAM and sample-attribution via an Explainable Boosting Machine to reconstruct full benchmark scores from a lean subset. Empirical results on multiple standard benchmarks show EssenceBench achieves substantial data reductions (up to approximately 200×) while preserving ranking fidelity, often matching or surpassing prior methods with far fewer examples. This approach enables faster, cheaper, and more scalable LLM evaluation without sacrificing reliability, with strong implications for ongoing benchmark design and model comparison.

Abstract

As the demand for comprehensive evaluations of diverse model capabilities steadily increases, benchmark suites have correspondingly grown significantly in scale. Despite notable advances in redundancy reduction and subset-level performance prediction, a systematic framework that effectively integrates these methods to ensure both prediction accuracy and ranking consistency is still largely elusive. In this paper, we first perform a sample-level analysis of benchmark redundancy and identify several highly similar samples that can be eliminated. Besides, we frame benchmark compression as an optimization problem with the aim of score reconstruction. Building on these, we then propose EssenceBench, a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA), which takes the advantages of fitness-based subset search and attribution-based sample search. Compared to previous methods, our approach yields superior compression results with lower reconstruction error and markedly higher efficiency. In particular, on the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.

Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

TL;DR

The paper tackles the rising cost of evaluating large language models by identifying pervasive redundancy in benchmark samples. It introduces EssenceBench, a coarse-to-fine benchmark compression framework that combines redundancy-aware filtering with an iterative genetic-algorithm search guided by a GAM and sample-attribution via an Explainable Boosting Machine to reconstruct full benchmark scores from a lean subset. Empirical results on multiple standard benchmarks show EssenceBench achieves substantial data reductions (up to approximately 200×) while preserving ranking fidelity, often matching or surpassing prior methods with far fewer examples. This approach enables faster, cheaper, and more scalable LLM evaluation without sacrificing reliability, with strong implications for ongoing benchmark design and model comparison.

Abstract

As the demand for comprehensive evaluations of diverse model capabilities steadily increases, benchmark suites have correspondingly grown significantly in scale. Despite notable advances in redundancy reduction and subset-level performance prediction, a systematic framework that effectively integrates these methods to ensure both prediction accuracy and ranking consistency is still largely elusive. In this paper, we first perform a sample-level analysis of benchmark redundancy and identify several highly similar samples that can be eliminated. Besides, we frame benchmark compression as an optimization problem with the aim of score reconstruction. Building on these, we then propose EssenceBench, a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA), which takes the advantages of fitness-based subset search and attribution-based sample search. Compared to previous methods, our approach yields superior compression results with lower reconstruction error and markedly higher efficiency. In particular, on the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.

Paper Structure

This paper contains 24 sections, 15 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Prevalent redundancy across widely used benchmark datasets. Based on 10 randomly sampled instances per dataset, panel (a) depicts the text embedding similarity (Definition \ref{['def:text_redundancy']}) , reflecting semantic overlap among instances, and panel (b) presents the ranking embedding similarity (Definition \ref{['def:ranking_redundancy']}), measured through consistency of model performance rankings across sampled subsets.
  • Figure 2: Comparison of existing benchmark compression approaches and our EssenceBench. (a) ranking and (b) text redundancy comparison and (c) compression time comparison.
  • Figure 3: The pipeline of EssenceBench. (I) Coarse Filtering. By extracting the binary score matrix for each benchmark and computing both text-level and ranking-level redundancies, samples that exceed thresholds are removed. (II) Subset Selection. Genetic Algorithm (GA) is applied beginning with generating random subsets. With fitness evaluated by the error of predicted accuracy, subsets are optimized via fitness-based tournament selection, crossover, mutation, and adjustment. (III) Sample Selection. Attribution of each sample is estimated from the top-performing subsets by utilizing weights when training a model. According to that, samples are divided into groups. GA is then reapplied within each group to identify the most representative and informative subset.
  • Figure 4: Ablation results on GSM8K, evaluating the effect of (a) coarse filtering, (b) attribution-based selection, and (c) grouping strategies.
  • Figure 5: Comparison of ranking change distributions between MetaBench and EssenceBench on the HellaSwag dataset, where $k$ denotes the subset size.

Theorems & Definitions (4)

  • Definition 1: Benchmark Compression
  • Definition 2: Concrete Formulation of Benchmark Compression
  • Definition 3: Sample Redundancy from Text Perspective
  • Definition 4: Sample Redundancy from Ranking Perspective