Table of Contents
Fetching ...

Prompt-to-Leaderboard

Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica

TL;DR

Prompt-to-Leaderboard (P2L) tackles the problem that aggregate LLM evaluations obscure prompt- and user-specific performance. It trains a meta-model to map a prompt $z$ to a per-prompt leaderboard $\theta^*(z)$ of Bradley-Terry coefficients, enabling prompt-conditioned evaluation, routing, and automated analysis; it further extends to Prompt-to-Regression to handle diverse feedback types. The approach supports efficient aggregation over prompt distributions, cost-aware and unconstrained routing, and automatic strength/weakness analysis, with strong empirical results on Chatbot Arena and LiveBench demonstrating improved prediction of human preferences, superior per-prompt routing, and robust generalization. The work shows scaling laws and practical benefits for personalized model selection, unsupervised task-specific evaluation, and granular insight into model strengths and weaknesses, with real-world routing gains evidenced by a top Arena placement in early 2025. Altogether, P2L provides a principled, scalable framework for nuanced LLM evaluation and deployment decisions that go beyond average-leaderboard summaries.

Abstract

Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot on the Chatbot Arena leaderboard. Our code is available on GitHub at https://github.com/lmarena/p2l.

Prompt-to-Leaderboard

TL;DR

Prompt-to-Leaderboard (P2L) tackles the problem that aggregate LLM evaluations obscure prompt- and user-specific performance. It trains a meta-model to map a prompt to a per-prompt leaderboard of Bradley-Terry coefficients, enabling prompt-conditioned evaluation, routing, and automated analysis; it further extends to Prompt-to-Regression to handle diverse feedback types. The approach supports efficient aggregation over prompt distributions, cost-aware and unconstrained routing, and automatic strength/weakness analysis, with strong empirical results on Chatbot Arena and LiveBench demonstrating improved prediction of human preferences, superior per-prompt routing, and robust generalization. The work shows scaling laws and practical benefits for personalized model selection, unsupervised task-specific evaluation, and granular insight into model strengths and weaknesses, with real-world routing gains evidenced by a top Arena placement in early 2025. Altogether, P2L provides a principled, scalable framework for nuanced LLM evaluation and deployment decisions that go beyond average-leaderboard summaries.

Abstract

Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot on the Chatbot Arena leaderboard. Our code is available on GitHub at https://github.com/lmarena/p2l.

Paper Structure

This paper contains 22 sections, 1 theorem, 24 equations, 11 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Assume that for every prompt $z$, the Bradley-Terry model holds with coefficients $\theta^*(z)$. Then, the optimization problems in problem:optimal-routing and problem:optimal-routing-bt are both equivalent to the following problem: where $\mathbf{W^*}$ represents the population win matrix, with entries $\mathbf{W}^*_{ba} = \sigma(\theta^*(z)_b - \theta^*(z)_a)$.

Figures (11)

  • Figure 1: Pipeline of P2L. P2L takes a prompt or a set of prompts and outputs an $M$-dimensional vector that we call a leaderboard. Once we have a leaderboard, we can build better data products, like routers and automatic analyses (see right).
  • Figure 2: Loss metrics. The line plot shows the validation loss as a function of the number of data points seen during training. The P2L models all substantially outperform the baselines, and performance scales with dataset and model size. The bar plots show the validation loss and mean squared error of the models trained on all 1.5M training points.
  • Figure 3: P2L router performance on Chatbot Arena. The left barplot shows the overall score of the router after it was deployed prospectively on Chatbot Arena. The right barplot shows the worst-case category score on Chatbot Arena. Overall, larger models lead to higher Arena scores, i.e., better routers. The exception is P2L-1.5B, which has a large bump in overall performance. However, the confidence intervals indicate that this bump is explainable by statistical variations in its BT coefficient estimate.
  • Figure 4: Router model choice distribution in each prompt category. The rows are different models, and the columns are different categories. Each cell represents the probability that the model was selected within that category (i.e., columns sum to 1). Models with an average selection rate below 1% are not shown.
  • Figure 5: Arena score versus cost. Both plots show routing performance as a function of average cost. The left plot shows the averaged performance across all categories, and the right plot shows the performance in the creative writing category. The black open circles give the raw performance and cost of the models used by the router. Each gold dot represents the Arena score of the P2L-7B router as a function of the cost constraint in \ref{['problem:optimal-routing-master']}. The plots show that the P2L router dominates and substantially improves the cost-performance Pareto frontier. All confidence intervals are 95%.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem 1: Optimal prompt-dependent routing
  • proof : Proof of Theorem \ref{['thm:optimal-router']}