Table of Contents
Fetching ...

DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

Eliya Habba, Ofir Arviv, Itay Itzhak, Yotam Perlitz, Elron Bandel, Leshem Choshen, Michal Shmueli-Scheuer, Gabriel Stanovsky

Abstract

Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: https://slab-nlp.github.io/DOVE/

DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

Abstract

Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: https://slab-nlp.github.io/DOVE/

Paper Structure

This paper contains 43 sections, 3 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Building Dove. To holistically explore LLM sensitivity, we sample prompts as a walk in the space of various prompt dimensions (rows, above).
  • Figure 2: Dove requires a diverse set of skills.
  • Figure 3: Performance variations across evaluation datasets. Each datapoint represents the accuracy of one model calculated across 100 instances. Vertical scatter plots illustrate the variance within each dataset and each model. Model performance varies substantially, indicating persistent prompt sensitivity prompts at large scales.
  • Figure 4: Accuracy marginalization for different dimensions. Variations along each of the dimensions in Dove lead to prompt sensitivity, even when controlling for all other dimensions.
  • Figure 5: Substantial performance differences across prompt perturbation. The number of standard deviations by which model performance on original instructions deviates from average across few-shot prompts. Dark cells show substantial divergence.
  • ...and 10 more figures