Table of Contents
Fetching ...

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

Kehua Feng, Keyan Ding, Hongzhi Tan, Kede Ma, Zhihua Wang, Shuangquan Guo, Yuzhou Cheng, Ge Sun, Guozhou Zheng, Qiang Zhang, Huajun Chen

TL;DR

This work introduces a sample-efficient framework for evaluating large language models by automatically selecting a small set of input instructions that maximize semantic discrepancy between model outputs (MAD Competition). Human evaluators then provide three-alternative judgments on the selected pairs, and the results are aggregated with an Elo rating to produce a global ranking. Across eight models and four tasks, MAD-Eval reproduces gold-standard rankings with dramatically reduced annotation effort and yields insights into model strengths and weaknesses. The method scales to more models, aligns with large-scale human leaderboards, and offers a valuable adversarial dataset for future model development and refinement.

Abstract

Reliable evaluation of large language models (LLMs) is impeded by two key challenges: objective metrics often fail to reflect human perception of natural language, and exhaustive human labeling is prohibitively expensive. Here, we propose a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) Competition. Our method automatically and adaptively selects a compact set of input instructions that maximize semantic discrepancy between pairs of LLM responses. Human evaluators then perform three-alternative forced choices on these paired responses, which are aggregated into a global ranking using Elo rating. We apply our approach to compare eight widely used LLMs across four tasks: scientific knowledge understanding, mathematical reasoning, creative and functional writing, and code generation and explanation. Experimental results show that our sample-efficient evaluation method recovers "gold-standard" model rankings with a handful of MAD-selected instructions, reveals respective strengths and weaknesses of each LLM, and offers nuanced insights to guide future LLM development. Code is available at https://github.com/weiji-Feng/MAD-Eval .

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

TL;DR

This work introduces a sample-efficient framework for evaluating large language models by automatically selecting a small set of input instructions that maximize semantic discrepancy between model outputs (MAD Competition). Human evaluators then provide three-alternative judgments on the selected pairs, and the results are aggregated with an Elo rating to produce a global ranking. Across eight models and four tasks, MAD-Eval reproduces gold-standard rankings with dramatically reduced annotation effort and yields insights into model strengths and weaknesses. The method scales to more models, aligns with large-scale human leaderboards, and offers a valuable adversarial dataset for future model development and refinement.

Abstract

Reliable evaluation of large language models (LLMs) is impeded by two key challenges: objective metrics often fail to reflect human perception of natural language, and exhaustive human labeling is prohibitively expensive. Here, we propose a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) Competition. Our method automatically and adaptively selects a compact set of input instructions that maximize semantic discrepancy between pairs of LLM responses. Human evaluators then perform three-alternative forced choices on these paired responses, which are aggregated into a global ranking using Elo rating. We apply our approach to compare eight widely used LLMs across four tasks: scientific knowledge understanding, mathematical reasoning, creative and functional writing, and code generation and explanation. Experimental results show that our sample-efficient evaluation method recovers "gold-standard" model rankings with a handful of MAD-selected instructions, reveals respective strengths and weaknesses of each LLM, and offers nuanced insights to guide future LLM development. Code is available at https://github.com/weiji-Feng/MAD-Eval .
Paper Structure (44 sections, 7 equations, 6 figures, 31 tables, 2 algorithms)

This paper contains 44 sections, 7 equations, 6 figures, 31 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of the proposed sample-efficient human evaluation method for comparing LLMs adaptively. Starting from a small set of task-specific seed instructions, we apply an instruction evolution procedure to generate a large-scale pool of diverse instructions. For any two competing LLMs, we then conduct MAD Competition to automatically and adaptively select the top-$K$ instructions (and their corresponding responses) that most effectively distinguish model behaviors. These selected response pairs are presented to human evaluators (along with the input instruction), who express pairwise preferences. Finally, we feed these comparison outcomes into an Elo rating system to produce a global ranking of all evaluated LLMs.
  • Figure 2: Task distribution in our experiment.
  • Figure 3: Sankey diagram of eight LLMs’ ranking shifts across our sample-efficient human evaluation method, Chatbot Arena, AlpacaEval-2.0, and CompassRank (Nov. 2024 snapshot).
  • Figure 4: Spearman's $\rho$ between the global model ranking produced using the default top-$10$ instructions and rankings obtained with reduced prompts ($K\in\{1,\ldots, 9\}$), plotted for each of the four tasks. Correlations exceed $0.95$ for $K \ge 5$ and reach $1.0$ for $K \ge 8$, illustrating the robustness of our sample-efficient evaluation method even under a constrained annotation budget.
  • Figure 5: Graphical user interface for collecting human preferences.
  • ...and 1 more figures