Table of Contents
Fetching ...

Statistical multi-metric evaluation and visualization of LLM system predictive performance

Samuel Ackerman, Eitan Farchi, Orna Raz, Assaf Toledo

TL;DR

The paper presents an open-source framework for statistically rigorous, multi-metric evaluation of LLM-based systems across multiple benchmarks and datasets. It automates test selection for paired and unpaired data, aggregates metrics via standardized scores and Wilson's harmonic mean $p$-value, and provides exploratory and formal visualizations to support robust decision-making. The approach is demonstrated on CrossCodeEval with 15 LLMs across four programming languages, showing many pairwise differences are statistically significant and yielding rankings that align with, but are substantiated beyond, simple averages. This framework offers a practical, statistically principled alternative to naive leaderboard aggregations, enabling more reliable model selection and component-level decision-making in real-world deployments.

Abstract

The evaluation of generative or discriminative large language model (LLM)-based systems is often a complex multi-dimensional problem. Typically, a set of system configuration alternatives are evaluated on one or more benchmark datasets, each with one or more evaluation metrics, which may differ between datasets. We often want to evaluate -- with a statistical measure of significance -- whether systems perform differently either on a given dataset according to a single metric, on aggregate across metrics on a dataset, or across datasets. Such evaluations can be done to support decision-making, such as deciding whether a particular system component change (e.g., choice of LLM or hyperparameter values) significantly improves performance over the current system configuration, or, more generally, whether a fixed set of system configurations (e.g., a leaderboard list) have significantly different performances according to metrics of interest. We present a framework implementation that automatically performs the correct statistical tests, properly aggregates the statistical results across metrics and datasets (a nontrivial task), and can visualize the results. The framework is demonstrated on the multi-lingual code generation benchmark CrossCodeEval, for several state-of-the-art LLMs.

Statistical multi-metric evaluation and visualization of LLM system predictive performance

TL;DR

The paper presents an open-source framework for statistically rigorous, multi-metric evaluation of LLM-based systems across multiple benchmarks and datasets. It automates test selection for paired and unpaired data, aggregates metrics via standardized scores and Wilson's harmonic mean -value, and provides exploratory and formal visualizations to support robust decision-making. The approach is demonstrated on CrossCodeEval with 15 LLMs across four programming languages, showing many pairwise differences are statistically significant and yielding rankings that align with, but are substantiated beyond, simple averages. This framework offers a practical, statistically principled alternative to naive leaderboard aggregations, enabling more reliable model selection and component-level decision-making in real-world deployments.

Abstract

The evaluation of generative or discriminative large language model (LLM)-based systems is often a complex multi-dimensional problem. Typically, a set of system configuration alternatives are evaluated on one or more benchmark datasets, each with one or more evaluation metrics, which may differ between datasets. We often want to evaluate -- with a statistical measure of significance -- whether systems perform differently either on a given dataset according to a single metric, on aggregate across metrics on a dataset, or across datasets. Such evaluations can be done to support decision-making, such as deciding whether a particular system component change (e.g., choice of LLM or hyperparameter values) significantly improves performance over the current system configuration, or, more generally, whether a fixed set of system configurations (e.g., a leaderboard list) have significantly different performances according to metrics of interest. We present a framework implementation that automatically performs the correct statistical tests, properly aggregates the statistical results across metrics and datasets (a nontrivial task), and can visualize the results. The framework is demonstrated on the multi-lingual code generation benchmark CrossCodeEval, for several state-of-the-art LLMs.

Paper Structure

This paper contains 26 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of terms (Section \ref{['ssec:basic_units']}) for paired data.
  • Figure 2: Boxplots of system value and rank distribution for the ES metric on CrossCodeEval's C# dataset. Systems are ordered from left to right in decreasing mean quality (for value distribution) or decreasing median rank.
  • Figure 3: Top: Connected graph using p-values for ID-precision metric on CrossCodeEval's Python dataset. Bottom: The graph split into system cliques of at least 2.
  • Figure 4: Heatmap of p-values for all metrics for CrossCodeEval's Python dataset, for four systems.
  • Figure 5: Connected graph using p-values on CrossCodeEval, aggregated across program language datasets.
  • ...and 2 more figures