Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition
Kehua Feng, Keyan Ding, Hongzhi Tan, Kede Ma, Zhihua Wang, Shuangquan Guo, Yuzhou Cheng, Ge Sun, Guozhou Zheng, Qiang Zhang, Huajun Chen
TL;DR
This work introduces a sample-efficient framework for evaluating large language models by automatically selecting a small set of input instructions that maximize semantic discrepancy between model outputs (MAD Competition). Human evaluators then provide three-alternative judgments on the selected pairs, and the results are aggregated with an Elo rating to produce a global ranking. Across eight models and four tasks, MAD-Eval reproduces gold-standard rankings with dramatically reduced annotation effort and yields insights into model strengths and weaknesses. The method scales to more models, aligns with large-scale human leaderboards, and offers a valuable adversarial dataset for future model development and refinement.
Abstract
Reliable evaluation of large language models (LLMs) is impeded by two key challenges: objective metrics often fail to reflect human perception of natural language, and exhaustive human labeling is prohibitively expensive. Here, we propose a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) Competition. Our method automatically and adaptively selects a compact set of input instructions that maximize semantic discrepancy between pairs of LLM responses. Human evaluators then perform three-alternative forced choices on these paired responses, which are aggregated into a global ranking using Elo rating. We apply our approach to compare eight widely used LLMs across four tasks: scientific knowledge understanding, mathematical reasoning, creative and functional writing, and code generation and explanation. Experimental results show that our sample-efficient evaluation method recovers "gold-standard" model rankings with a handful of MAD-selected instructions, reveals respective strengths and weaknesses of each LLM, and offers nuanced insights to guide future LLM development. Code is available at https://github.com/weiji-Feng/MAD-Eval .
