Project MPG: towards a generalized performance benchmark for LLM capabilities

Lucas Spangher; Tianle Li; William F. Arnold; Nick Masiewicki; Xerxes Dotiwalla; Rama Parusmathi; Peter Grabowski; Eugene Ie; Dan Gruhl

Project MPG: towards a generalized performance benchmark for LLM capabilities

Lucas Spangher, Tianle Li, William F. Arnold, Nick Masiewicki, Xerxes Dotiwalla, Rama Parusmathi, Peter Grabowski, Eugene Ie, Dan Gruhl

TL;DR

A method to aggregate performance across a general space of benchmarks, nicknamed ProjectPG, Dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance.

Abstract

There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a single number is the most actionable for decision-making, especially by non-experts. No such aggregation schema exists that is not Elo-based, which could be costly or time-consuming. Here we propose a method to aggregate performance across a general space of benchmarks, nicknamed Project "MPG," dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance. Here, we create two numbers: a "Goodness" number (answer accuracy) and a "Fastness" number (cost or QPS). We compare models against each other and present a ranking according to our general metric as well as subdomains. We find significant agreement between the raw Pearson correlation of our scores and those of Chatbot Arena, even improving on the correlation of the MMLU leaderboard to Chatbot Arena.

Project MPG: towards a generalized performance benchmark for LLM capabilities

TL;DR

Abstract

Project MPG: towards a generalized performance benchmark for LLM capabilities

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)