Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, James Bono
TL;DR
This work investigates how benchmark prompts for evaluating large language models may be biased by non-random correlations among prompts, which can alter model rankings when different prompt distributions are assumed. It introduces a framework combining performance matrices, permutation and KS tests, clustering, and semantic embeddings to diagnose prompt correlations and assess the robustness of benchmark-based comparisons. Empirical results show significant prompt-level correlations across major benchmarks, with ranking shifts up to five positions and performance changes up to about ten percent under alternative weighting schemes; semantic similarity explains some of the observed patterns, especially in task types where failure modes dominate. The findings provide a diagnostic tool for assessing benchmark robustness and guidance for designing benchmarks less susceptible to distributional biases in LLM evaluations.
Abstract
Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
