Table of Contents
Fetching ...

BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity

Harshita Diddee, Gregory Yauney, Swabha Swayamdipta, Daphne Ippolito

Abstract

Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.

BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity

Abstract

Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.
Paper Structure (77 sections, 12 figures, 12 tables)

This paper contains 77 sections, 12 figures, 12 tables.

Figures (12)

  • Figure 1: BenchBrowser retrieves use-case-relevant items from 20+ benchmarks to help diagnose validity failures. For a "Writing Programming Functions" use-case: Python-skewed coverage (left; content validity gap) and unstable model ranks across subsets (right; low convergent validity) make conclusions unreliable. $m_{1}$ to $m_{k}$ are mid-sized decoder models. See Appendix \ref{['app:motivation-experiment']} for details.
  • Figure 2: BenchBrowser pipeline: a practitioner submits a use-case of interest to BenchBrowser; the use-case is rewritten into anchors for more diverse retrieval, which are then embedded and used to retrieve relevant examples from a suite of 20+ benchmarks. The retrieved evidence can be used by the practitioner in multiple ways (images show snapshots of BenchBrowser's website UI): Comparing the representation of the use-case along different facets. For example, a user interested in assessing "reasoning skills" can observe the different ways (or lack thereof) in which reasoning competence is evaluated (linguistics fundamentals versus math problems) and Compare the performance and stability of models evaluated on existing benchmarks (e.g., ARC AI2 Reasoning) for the task as well as subsets of the retrieved examples.
  • Figure 3: Variation in the percentage of Relevant examples across different facets of the same broad use-case. For each skill family, we show the facet for which the highest and lowest number of examples are retrieved. Most capabilities show sharp disparities: certain facets are heavily over-represented in our database of 70k benchmark examples, while others are nearly absent from benchmarks.
  • Figure 4: Kendall’s $\tau$ correlations of model rankings on gold test sets vs. retrieved sets for known validation set use-cases with mean human relevance $\geq 0.5$. Wider bars correspond to greater rank divergence $\Delta = \tau_\text{gold} - \tau_\text{ret}$. $n$: number of retrieved examples.
  • Figure 5: We observe that performance on the retrieved testset and the gold test set (BBH and MMLU for the topics set set) show reasonable to high correlation.
  • ...and 7 more figures