Table of Contents
Fetching ...

Ranking Large Language Models without Ground Truth

Amit Dhurandhar, Rahul Nair, Moninder Singh, Elizabeth Daly, Karthikeyan Natesan Ramamurthy

TL;DR

Ranking Large Language Models without Ground Truth proposes triplet-based, ground-truth-free benchmarking for LLMs. It introduces Greedy Triplet Ranking (GTR) and Full Triplet Ranking (FTR), which deduce the weakest model in triplets or via iterative reputation scoring, respectively, requiring only a dataset of prompts and a comparison function, not reference responses. The authors derive sufficient conditions under which triplet judgments identify weaker models, analyze time complexity (GTR: O(n^2), FTR: O(n^3)), and empirically show that GTR and FTR recover rankings across summarization, MCQA, and dialog tasks, often matching or approaching ground-truth baselines while being robust to evaluation noise. The results suggest a scalable, low-resource approach for model ranking without reliance on ground-truth responses, with potential applications in diverse domains and tasks.

Abstract

Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data. This points to a viable low-resource mechanism for practical use.

Ranking Large Language Models without Ground Truth

TL;DR

Ranking Large Language Models without Ground Truth proposes triplet-based, ground-truth-free benchmarking for LLMs. It introduces Greedy Triplet Ranking (GTR) and Full Triplet Ranking (FTR), which deduce the weakest model in triplets or via iterative reputation scoring, respectively, requiring only a dataset of prompts and a comparison function, not reference responses. The authors derive sufficient conditions under which triplet judgments identify weaker models, analyze time complexity (GTR: O(n^2), FTR: O(n^3)), and empirically show that GTR and FTR recover rankings across summarization, MCQA, and dialog tasks, often matching or approaching ground-truth baselines while being robust to evaluation noise. The results suggest a scalable, low-resource approach for model ranking without reliance on ground-truth responses, with potential applications in diverse domains and tasks.

Abstract

Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods to rank LLMs. In experiments on different generative tasks (summarization, multiple-choice, and dialog), our methods reliably recover close to true rankings without reference data. This points to a viable low-resource mechanism for practical use.
Paper Structure (27 sections, 3 theorems, 15 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 3 theorems, 15 figures, 6 tables, 2 algorithms.

Key Result

Lemma 1

Given a triplet of models $(M_i,M_j,M_k)$, where their accuracies $(a_i,a_j,a_k)$ satisfy $1\ge a_i>a_j>a_k\ge 0$ with no two models agreeing upon incorrect responses, then $a_k < a_i+a_j-1$ will result in $M_k$ being (correctly) voted as the worse model by both $M_i$ and $M_j$ as judges.

Figures (15)

  • Figure 1: We see the intuition behind the triplet approach. The three models $M_1$, $M_2$ and $M_3$ have accuracies of $80\%$, $60\%$ and $40\%$ respectively based on their responses to five prompts (green are correct responses and red incorrect) when compared with the ground truth which is unknown to us. Our triplet approach ranks $M_3$ as the worst model, since it is ranked as such by both $M_1$ (only two answers match) and $M_2$ (only one answer matches). This core idea (with slight variations) can be applied repeatedly to rank an arbitrary number of models as described by the algorithms in Section \ref{['sec:meth']}.
  • Figure 2: Evaluation metrics on summarization for two datasets: CNN/DM (top) and XSUM (bottom), RBO (left) and MAP-5 (right), as a function of number of models being ranked (note x-axis is not linear).
  • Figure 3: Number of triplet evaluations for CNN/DM dataset. (log y-scale).
  • Figure 4: Quality of rankings recovered as a function of noise in the evaluation function for different methods. Four sets of $10$ LLMs with each set having a maximum LLM accuracy of $30\%$, $50\%$, $70\%$ or $90\%$ are considered, where the number of questions is $50$. We see the robustness of the proposed methods (GTR and FTR) at low to medium levels of noise in such a setup.
  • Figure 5: Evaluation metrics on multiple-choice, RBO (left) and MAP-5 (right), when ranking $25$ models where the accuracy of the best performing model is 50%.
  • ...and 10 more figures

Theorems & Definitions (6)

  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Proposition 1
  • proof