Table of Contents
Fetching ...

Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks

Marco AF Pimentel, Clément Christophe, Tathagata Raha, Prateek Munjal, Praveen K Kanithi, Shadab Khan

TL;DR

This study interrogates variability in large language model evaluation by dissecting metric calculation methods used by prominent MCQ-focused frameworks (OpenCompass, Eval Harness, HELM) across four datasets. It formalizes MCQ evaluation with $Q$, $A_i$, $c_i$, $\,hat{c}_i$, and $P(q_{m+1}|q_{0:m})$, comparing token-probability and text-generation approaches. The authors find substantial cross-framework variability (5–26% within datasets) and inconsistent normalization effects, underscoring how methodology shapes reported performance beyond model quality. They argue for rigorous, transparent reporting of evaluation procedures to enable reproducibility and fair cross-model comparisons in LLM benchmarking.

Abstract

As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge that requires careful consideration of various linguistic tasks, model architectures, and benchmarking methodologies. In recent years, various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks for assessing the capabilities of LLMs across diverse domains. This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.

Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks

TL;DR

This study interrogates variability in large language model evaluation by dissecting metric calculation methods used by prominent MCQ-focused frameworks (OpenCompass, Eval Harness, HELM) across four datasets. It formalizes MCQ evaluation with , , , , and , comparing token-probability and text-generation approaches. The authors find substantial cross-framework variability (5–26% within datasets) and inconsistent normalization effects, underscoring how methodology shapes reported performance beyond model quality. They argue for rigorous, transparent reporting of evaluation procedures to enable reproducibility and fair cross-model comparisons in LLM benchmarking.

Abstract

As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge that requires careful consideration of various linguistic tasks, model architectures, and benchmarking methodologies. In recent years, various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks for assessing the capabilities of LLMs across diverse domains. This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.
Paper Structure (11 sections, 3 equations, 3 figures, 2 tables)

This paper contains 11 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Performance disparities ($\delta$) of Llama2 (7B, 13B and 70B) and Mistral-7B on various benchmark datasets (in 0-shot setting). $\delta$ values are shown using lighter colors and they represent the variation observed in the accuracy metric across the benchmarks due to different evaluation frameworks.
  • Figure 2: Bland-Altman plots (left) and frequency plots or histograms (left) of the length difference between correct and wrong options for the HellaSwag benchmark dataset. The length difference for the entire dataset is shown in black (in both top and bottom panels). In the top panel, the length differences for the instances in which Mistral-7B incorrectly selected the wrong option in the unnormalized likelihood method (raw-based accuracy) are overlayed in red. On the bottom panel, the length differences for the instances in which Mistral-7B incorrectly selected the wrong option in the B-norm accuracy method are overlayed in red.
  • Figure 3: Bland-Altman plots (left) and frequency plots or histograms (left) of the length difference between correct and wrong options for the MedQA benchmark dataset. The length difference for the entire dataset is shown in black (in both top and bottom panels). In the top panel, the length differences for the instances in which Llama2-70B incorrectly selected the wrong option in the unnormalized likelihood scenario (raw-based accuracy) are overlayed in red. On the bottom panel, the length differences for the instances in which Llama2-70B incorrectly selected the wrong option in the B-norm accuracy scenario are overlayed in red.