Table of Contents
Fetching ...

Anchor Points: Benchmarking Models with Much Fewer Examples

Rajan Vivek, Kawin Ethayarajh, Diyi Yang, Douwe Kiela

TL;DR

This work introduces micro-benchmarking via Anchor Point Selection to evaluate large language benchmarks with far fewer examples. By leveraging cross-model predictive correlations, a small set of anchor points can rank hundreds of models and even estimate per-instance predictions across the full dataset. Anchor Point Maps provide a visual, region-focused view of model weaknesses and generalization patterns, enabling fine-grained comparisons without exhaustive evaluation. While promising, the approach depends on transferability of predictive correlations across model families and incurs limitations in generalization and computation that warrant further theoretical and methodological development. Overall, Anchor Points offer a practical route to cheaper, interpretable model benchmarking with broad applicability and clear avenues for future work.

Abstract

Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Anchor points reliably rank models: across 87 diverse language model-prompt pairs, evaluating models using 1-30 anchor points outperforms uniform sampling and other baselines at accurately ranking models. Moreover, just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error, sufficient for gauging where the model is likely to fail. Lastly, we present Anchor Point Maps for visualizing these insights and facilitating comparisons of the performance of different models on various regions within the dataset distribution.

Anchor Points: Benchmarking Models with Much Fewer Examples

TL;DR

This work introduces micro-benchmarking via Anchor Point Selection to evaluate large language benchmarks with far fewer examples. By leveraging cross-model predictive correlations, a small set of anchor points can rank hundreds of models and even estimate per-instance predictions across the full dataset. Anchor Point Maps provide a visual, region-focused view of model weaknesses and generalization patterns, enabling fine-grained comparisons without exhaustive evaluation. While promising, the approach depends on transferability of predictive correlations across model families and incurs limitations in generalization and computation that warrant further theoretical and methodological development. Overall, Anchor Points offer a practical route to cheaper, interpretable model benchmarking with broad applicability and clear avenues for future work.

Abstract

Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Anchor points reliably rank models: across 87 diverse language model-prompt pairs, evaluating models using 1-30 anchor points outperforms uniform sampling and other baselines at accurately ranking models. Moreover, just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error, sufficient for gauging where the model is likely to fail. Lastly, we present Anchor Point Maps for visualizing these insights and facilitating comparisons of the performance of different models on various regions within the dataset distribution.
Paper Structure (44 sections, 2 equations, 17 figures, 10 tables)

This paper contains 44 sections, 2 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: SST-2 Validation Set Anchor Point Map. The locations of all 872 points are learned using the predictions of $N=60$ randomly-selected source models on SST-2. (Note $N$ can be as small as 10; see Table \ref{['tab:agreement_table']}). We then evaluate a held-out model, Falcon-7B, on 30 anchor points (green triangles). The model's predictions on only these 30 points are used to estimate the Falcon-7B predictions on the remaining 842 points with a mean absolute error of 0.09, achieving 92% agreement with the model's true predictions. The anchor points identify regions where the model is weak (red regions). We show the same Anchor Point Map colored by the true Falcon-7B predictions in Figure \ref{['fig:sst2-corrmap-comparison']}, demonstrating that the model is indeed weak in these areas.
  • Figure 2: Anchor Points are Micro-Benchmarks, tiny representative subsets of large benchmarks. Correlative structure in the predictions of source models on the large benchmark can be used to extract these points. Each anchor point has a weight corresponding to the fraction of the benchmark it represents. Evaluating models on the anchor points produces a score that rank correlates with performance on the entire benchmark. Anchor Point Maps visualize a given model's likely instance-level performance on all points in the benchmark using only its performance on the anchor points.
  • Figure 3: Predictive Correlations at the Instance-Level Across Language Models
  • Figure 4: Analysing Patterns in Model Knowledge using MMLU Anchor Point Map.
  • Figure 5: Anchor Point Map for 1000 QQP examples. The map is computed using the predictions of 60 randomly-selected source models. We then estimate the predictions of the two held-out target models, deberta-v3-base and text-davinci-003, by evaluating each on 30 anchor points. We color the remaining 970 test points in \ref{['fig:deberta-est']} and \ref{['fig:davinci-est']} with these estimates. Finally, we color maps \ref{['fig:deberta-true']} and \ref{['fig:davinci-true']} with the true target model predictions. We observe that the estimated predictions achieve low MAE and high agreement with the true predictions.
  • ...and 12 more figures