Table of Contents
Fetching ...

Project MPG: towards a generalized performance benchmark for LLM capabilities

Lucas Spangher, Tianle Li, William F. Arnold, Nick Masiewicki, Xerxes Dotiwalla, Rama Parusmathi, Peter Grabowski, Eugene Ie, Dan Gruhl

TL;DR

A method to aggregate performance across a general space of benchmarks, nicknamed ProjectPG, Dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance.

Abstract

There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a single number is the most actionable for decision-making, especially by non-experts. No such aggregation schema exists that is not Elo-based, which could be costly or time-consuming. Here we propose a method to aggregate performance across a general space of benchmarks, nicknamed Project "MPG," dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance. Here, we create two numbers: a "Goodness" number (answer accuracy) and a "Fastness" number (cost or QPS). We compare models against each other and present a ranking according to our general metric as well as subdomains. We find significant agreement between the raw Pearson correlation of our scores and those of Chatbot Arena, even improving on the correlation of the MMLU leaderboard to Chatbot Arena.

Project MPG: towards a generalized performance benchmark for LLM capabilities

TL;DR

A method to aggregate performance across a general space of benchmarks, nicknamed ProjectPG, Dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance.

Abstract

There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a single number is the most actionable for decision-making, especially by non-experts. No such aggregation schema exists that is not Elo-based, which could be costly or time-consuming. Here we propose a method to aggregate performance across a general space of benchmarks, nicknamed Project "MPG," dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance. Here, we create two numbers: a "Goodness" number (answer accuracy) and a "Fastness" number (cost or QPS). We compare models against each other and present a ranking according to our general metric as well as subdomains. We find significant agreement between the raw Pearson correlation of our scores and those of Chatbot Arena, even improving on the correlation of the MMLU leaderboard to Chatbot Arena.

Paper Structure

This paper contains 13 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Outcome of our MPG benchmark applied to thirteen publicly facing language models. Here, the x axis is the "Performance" (Queries Per Second), which we express on the log scale, and the y axis is "Goodness" (our benchmark's outcome). The error is 95% confidence intervals described in Section \ref{['sec:score_agg']}.
  • Figure 2: Hierarchical structure of MPG metrics. Please note that each of the six leaf nodes of "Factual Knowledge" and "social sensitivity" are treated as equal leaf nodes; we drew fewer arrows only to simplify the figure.
  • Figure 3: Taxonomy of subject groupings for the benchmark.
  • Figure 4: Orderings of the LLMs we studied.
  • Figure 5: Raw score correlation between MPG and LMSys Chatbot Arena scores. We find a significant correlation between the two.
  • ...and 1 more figures