Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Melissa Ailem; Katerina Marazopoulou; Charlotte Siska; James Bono

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, James Bono

TL;DR

This work investigates how benchmark prompts for evaluating large language models may be biased by non-random correlations among prompts, which can alter model rankings when different prompt distributions are assumed. It introduces a framework combining performance matrices, permutation and KS tests, clustering, and semantic embeddings to diagnose prompt correlations and assess the robustness of benchmark-based comparisons. Empirical results show significant prompt-level correlations across major benchmarks, with ranking shifts up to five positions and performance changes up to about ten percent under alternative weighting schemes; semantic similarity explains some of the observed patterns, especially in task types where failure modes dominate. The findings provide a diagnostic tool for assessing benchmark robustness and guidance for designing benchmarks less susceptible to distributional biases in LLM evaluations.

Abstract

Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

TL;DR

Abstract

Paper Structure (32 sections, 3 equations, 12 figures, 4 tables)

This paper contains 32 sections, 3 equations, 12 figures, 4 tables.

Introduction
Related work
Proposed method
Problem setup
Determining if performance vectors are correlated
Effect of non-uniform weights in aggregate performance metrics
Cluster-based:
Increasing benchmark size
Random distributions of weights
Comparing performance vectors with semantic embeddings of prompts
Experimental setup
Benchmarks
ANLI
HellaSwag
CommonsenseQA
...and 17 more sections

Figures (12)

Figure 1: Illustrative example showcasing how different distributional assumptions of benchmarks affect model rankings. Consider a benchmark containing prompts reflecting three different tasks: math (red triangles), code generation (blue circles), and text generation (green squares). In Figure \ref{['subfig:unweighted']}, each benchmark prompt contributes equally to the model evaluation. In contrast, Figure \ref{['subfig:weighted']} accounts for correlations between prompts and the weights of the prompts are adjusted accordingly during evaluation. In scenario \ref{['subfig:unweighted']}, the red LLM ranks highest because it excels in math, and the benchmark is biased towards math tasks (7 out of 12 prompts are math-related). When considering different weights in scenario \ref{['subfig:weighted']}, we observe a different ranking outcome.
Figure 2: Visualization of ranking changes (compared to original benchmark) for various benchmark modifications. Rows show different weighting methods, columns show the models. Each cell contains the ranking change (original ranking minus new ranking) of the column-model for the row-method. We observe rank changes as great as 5.
Figure 3: Average performance as benchmark size increases. Prompts are added to maximize average cosine distance. Maximum benchmark size corresponds to performance on the original benchmark.
Figure 4: Pairwise comparison of weighted performance. Each cell is the percentage of times the model of the row outperforms the model of the column.
Figure 5: Distribution of semantic similarity coefficients and FDRs for all benchmarks. Red is original data, blue is permutations. KS tests for all distributions shown have p-values < 2e-5.
...and 7 more figures

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

TL;DR

Abstract

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (12)