Table of Contents
Fetching ...

State of What Art? A Call for Multi-Prompt LLM Evaluation

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, Gabriel Stanovsky

TL;DR

<p>The paper tackles the problem of evaluating large language models (LLMs) using a single instruction template, which the authors show yields brittle and unreliable results across models and tasks. They introduce a multi-prompt evaluation framework and publish a dataset of over 175 automatically generated paraphrases (roughly 5K instructions) to enable robust benchmarking across 6.5M instances for 20 LLMs over 39 tasks. The work defines MaxP, AvgP, and a Combined Performance Score (CPS) to capture peak capability, robustness to paraphrase, and a balance between the two, and demonstrates that model rankings and absolute performances can differ substantially when evaluated with paraphrase sets. The findings argue for adopting multi-prompt evaluation tailored to evaluators’ goals, which should improve the consistency, comparability, and real-world relevance of LLM assessments.</p>

Abstract

Recent advances in large language models (LLMs) have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.

State of What Art? A Call for Multi-Prompt LLM Evaluation

TL;DR

<p>The paper tackles the problem of evaluating large language models (LLMs) using a single instruction template, which the authors show yields brittle and unreliable results across models and tasks. They introduce a multi-prompt evaluation framework and publish a dataset of over 175 automatically generated paraphrases (roughly 5K instructions) to enable robust benchmarking across 6.5M instances for 20 LLMs over 39 tasks. The work defines MaxP, AvgP, and a Combined Performance Score (CPS) to capture peak capability, robustness to paraphrase, and a balance between the two, and demonstrates that model rankings and absolute performances can differ substantially when evaluated with paraphrase sets. The findings argue for adopting multi-prompt evaluation tailored to evaluators’ goals, which should improve the consistency, comparability, and real-world relevance of LLM assessments.</p>

Abstract

Recent advances in large language models (LLMs) have led to the development of various evaluation benchmarks. These benchmarks typically rely on a single instruction template for evaluating all LLMs on a specific task. In this paper, we comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead. We discuss tailored evaluation metrics for specific use cases (e.g., LLM developers vs. developers interested in a specific downstream task), ensuring a more reliable and meaningful assessment of LLM capabilities. We then implement these criteria and conduct evaluations of multiple models, providing insights into the true strengths and limitations of current LLMs.
Paper Structure (37 sections, 6 equations, 7 figures, 7 tables)

This paper contains 37 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Evaluation of different OpenAI models on the homophones task from LMentry over four paraphrases. Each cluster of columns corresponds to a distinct paraphrased instruction template (see respective texts below; words in bold indicate an instantiation). Despite all instructions being semantically equivalent, both absolute performance and relative ranking vary widely.
  • Figure 2: Model performance and ranking induced by pairs of paraphrases that exhibit the minimal Kendall $\tau$ correlation on three different tasks (one for each benchmark). For each template pair, models are ordered according to their performance against the first instruction template $P_1$, enabling straightforward comparisons of ranking changes. In other words, if the bars of $P_2$ appear scattered rather than follow a clear descending order, this indicates a significant reshuffling of rankings.
  • Figure 3: Model and task performance divergence. For each LMentry task, we show the number of standard deviations by which performance of each model on the original instructions deviates from averaged performance. Dark cells indicate substantial divergence values (>1 std).
  • Figure 4: Average performance differences between various models for the most common minimal edits between two instruction templates (e.g., substituting 'excludes' with 'lacks') in the LMentry benchmark.
  • Figure 5: Percentage of instruction paraphrases with accuracy higher than 5% in T5 models (blue) vs. LLaMA models (purple) on LMentry tasks.
  • ...and 2 more figures