Table of Contents
Fetching ...

Stronger Random Baselines for In-Context Learning

Gregory Yauney, David Mimno

TL;DR

This work accounts for the common practice of validation set reuse and existing small datasets with a stronger random baseline: the expected maximum accuracy across multiple random classifiers, which provides an easily calculated drop-in replacement for the standard baseline.

Abstract

Evaluating the in-context learning classification performance of language models poses challenges due to small dataset sizes, extensive prompt-selection using the validation set, and intentionally difficult tasks that lead to near-random performance. The standard random baseline--the expected accuracy of guessing labels uniformly at random--is stable when the evaluation set is used only once or when the dataset is large. We account for the common practice of validation set reuse and existing small datasets with a stronger random baseline: the expected maximum accuracy across multiple random classifiers. When choosing the best prompt demonstrations across six quantized language models applied to 16 BIG-bench Lite tasks, more than 20% of the few-shot results that exceed the standard baseline do not exceed this stronger random baseline. When held-out test sets are available, this stronger baseline is also a better predictor of held-out performance than the standard baseline, avoiding unnecessary test set evaluations. This maximum random baseline provides an easily calculated drop-in replacement for the standard baseline.

Stronger Random Baselines for In-Context Learning

TL;DR

This work accounts for the common practice of validation set reuse and existing small datasets with a stronger random baseline: the expected maximum accuracy across multiple random classifiers, which provides an easily calculated drop-in replacement for the standard baseline.

Abstract

Evaluating the in-context learning classification performance of language models poses challenges due to small dataset sizes, extensive prompt-selection using the validation set, and intentionally difficult tasks that lead to near-random performance. The standard random baseline--the expected accuracy of guessing labels uniformly at random--is stable when the evaluation set is used only once or when the dataset is large. We account for the common practice of validation set reuse and existing small datasets with a stronger random baseline: the expected maximum accuracy across multiple random classifiers. When choosing the best prompt demonstrations across six quantized language models applied to 16 BIG-bench Lite tasks, more than 20% of the few-shot results that exceed the standard baseline do not exceed this stronger random baseline. When held-out test sets are available, this stronger baseline is also a better predictor of held-out performance than the standard baseline, avoiding unnecessary test set evaluations. This maximum random baseline provides an easily calculated drop-in replacement for the standard baseline.
Paper Structure (35 sections, 11 equations, 11 figures, 4 tables)

This paper contains 35 sections, 11 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: 200 different prompts for the emoji_movie task yield a spread of accuracies for OLMo-7B (4-shot, quantized). The best prompt has much higher accuracy than the expected performance of a single random classifier. But its performance is worse than the expected maximum accuracy among 200 different random classifiers.
  • Figure 2: The expected maximum accuracy achieved among $t$ random classifiers on a binary classification dataset depends on $t$ and the size of the dataset.
  • Figure 3: OLMo-7B 1-, 2-, and 4-shot beats the standard random baseline (dashed line) on four tasks in expected maximum validation accuracy. But accounting for validation set reuse with the maximum random baseline (solid black line), the best accuracies across prompts on the left two datasets are in fact no better than random.
  • Figure 4: Expected maximum validation accuracy compared to the standard random and maximum random baseline for base and instruction-tuned models on a single hard dataset.
  • Figure 5: ROC and precision-recall curves when using maximum validation accuracy to predict whether held-out test accuracy will be above random chance. standard and maximum curves use binary predictions of above or below the given random baseline. The gray curve uses the distribution functions for confidence scores.
  • ...and 6 more figures