Table of Contents
Fetching ...

Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation

Jasper Dekoninck, Maximilian Baader, Martin Vechev

TL;DR

Polyrating tackles the core challenges of evaluating large language models by explicitly modeling judge biases and enabling cross-task comparisons with a MAP-based, multivariate rating framework. It extends the Bradley–Terry model with shared bias features and task-specific modifiers, allowing ratings to reflect both model quality and evaluation context while leveraging existing benchmarks and LLM-based evaluations to improve sample efficiency. The framework provides convergence guarantees, uncertainty quantification via bootstrapping, and a multidimensional leaderboard, enabling nuanced comparisons of LLM strengths across tasks. By removing shift-invariance and quantifying biases, Polyrating offers more reliable model rankings and cost-effective evaluation suitable for real-world benchmarking across diverse tasks.

Abstract

Rating-based human evaluation has become an essential tool to accurately evaluate the impressive performance of large language models (LLMs). However, current rating systems suffer from several important limitations: first, they fail to account for biases that significantly influence evaluation results, second, they require large and expensive preference datasets to obtain accurate ratings, and third, they do not facilitate meaningful comparisons of model ratings across different tasks. To address these issues, we introduce Polyrating, an expressive and flexible rating system based on maximum a posteriori estimation that enables a more nuanced and thorough analysis of model performance at lower costs. Polyrating can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. Further, Polyrating can reduce the cost of human evaluations by up to $41\%$ for new models and up to $77\%$ for new tasks by leveraging existing benchmark scores. Lastly, Polyrating enables direct comparisons of ratings across different tasks, providing a comprehensive understanding of an LLMs' strengths, weaknesses, and relative performance across different applications.

Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation

TL;DR

Polyrating tackles the core challenges of evaluating large language models by explicitly modeling judge biases and enabling cross-task comparisons with a MAP-based, multivariate rating framework. It extends the Bradley–Terry model with shared bias features and task-specific modifiers, allowing ratings to reflect both model quality and evaluation context while leveraging existing benchmarks and LLM-based evaluations to improve sample efficiency. The framework provides convergence guarantees, uncertainty quantification via bootstrapping, and a multidimensional leaderboard, enabling nuanced comparisons of LLM strengths across tasks. By removing shift-invariance and quantifying biases, Polyrating offers more reliable model rankings and cost-effective evaluation suitable for real-world benchmarking across diverse tasks.

Abstract

Rating-based human evaluation has become an essential tool to accurately evaluate the impressive performance of large language models (LLMs). However, current rating systems suffer from several important limitations: first, they fail to account for biases that significantly influence evaluation results, second, they require large and expensive preference datasets to obtain accurate ratings, and third, they do not facilitate meaningful comparisons of model ratings across different tasks. To address these issues, we introduce Polyrating, an expressive and flexible rating system based on maximum a posteriori estimation that enables a more nuanced and thorough analysis of model performance at lower costs. Polyrating can detect and quantify biases affecting human preferences, ensuring fairer model comparisons. Further, Polyrating can reduce the cost of human evaluations by up to for new models and up to for new tasks by leveraging existing benchmark scores. Lastly, Polyrating enables direct comparisons of ratings across different tasks, providing a comprehensive understanding of an LLMs' strengths, weaknesses, and relative performance across different applications.
Paper Structure (56 sections, 4 theorems, 29 equations, 5 figures, 6 tables)

This paper contains 56 sections, 4 theorems, 29 equations, 5 figures, 6 tables.

Key Result

Theorem 1

The optimization objective in eq:multi-logistic-loss is convex and twice differentiable.

Figures (5)

  • Figure 1: Overview of Polyrating. Given preference datasets of $n$ samples for $k$ models over various tasks, the standard approach needs to fit separate and independent ratings for each task and cannot leverage continuous features. In contrast, Polyrating fits a single linear model for all tasks and can leverage continuous features. Attribution in \ref{['app:attribution']}.
  • Figure 2: Comparison between Polyrating and univariate baseline for different tasks. The $x$-axis shows the number of samples of the task the rating systems are using. The logistic loss shown is normalized by subtracting the loss of the best possible rating for that task. The grey horizontal line indicates the loss of a rating system that assigns the same rating to all models.
  • Figure 3: Comparison between Polyrating and the univariate baseline when leveraging information from existing benchmarks. For the left and middle plot, the $x$-axis shows the number of human annotations used. For the right plot, the $x$-axis shows the amount of samples from the Chinese code task. The logistic loss is normalized by subtracting the loss of the best possible rating.
  • Figure 4: Logistic loss for all four alternatives on the Chatbot Arena dataset for various sizes of the training set.
  • Figure 5: Convergence rate of the univariate method and Polyrating when evaluating model versions. The x-axis represents the number of games available in the training set associated with the subsequent versions, while the y-axis represents the loss of the univariate method and Polyrating.

Theorems & Definitions (8)

  • Theorem 1: Convexity of the Optimization Objective
  • proof
  • Theorem 2: Equivalence of Ratings
  • Lemma 1: Shift-Invarance of Optimal Ratings
  • proof
  • Lemma 2: Limit Exists and Is Finite
  • proof
  • proof