State of What Art? A Call for Multi-Prompt LLM Evaluation
Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, Gabriel Stanovsky
TL;DR
<p>The paper tackles the problem of evaluating large language models (LLMs) using a single instruction template, which the authors show yields brittle and unreliable results across models and tasks. They introduce a multi-prompt evaluation framework and publish a dataset of over 175 automatically generated paraphrases (roughly 5K instructions) to enable robust benchmarking across 6.5M instances for 20 LLMs over 39 tasks. The work defines MaxP, AvgP, and a Combined Performance Score (CPS) to capture peak capability, robustness to paraphrase, and a balance between the two, and demonstrates that model rankings and absolute performances can differ substantially when evaluated with paraphrase sets. The findings argue for adopting multi-prompt evaluation tailored to evaluators’ goals, which should improve the consistency, comparability, and real-world relevance of LLM assessments.</p>
Abstract
Recent advances in large language models (LLMs) have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.
