Table of Contents
Fetching ...

BenchBench: Benchmarking Automated Benchmark Generation

Yandan Zheng, Haoran Luo, Zhenghong Lin, Wenjin Liu, Luu Anh Tuan

Abstract

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer--answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory-of-mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain ~15K core items post-filtering, and produce ~152K graded model--item responses. BenchBench shows that benchmark-design ability is only moderately correlated with answer-time strength (Spearman rho ~0.37), invalidity is negatively associated with discrimination (Pearson r~0.62), and the resulting designer--answerer matrices enable scalable audits of format/modality/language fidelity and suite-dependent self/family interactions. The project is available at: https://github.com/koanatakiyo/BenchBench.

BenchBench: Benchmarking Automated Benchmark Generation

Abstract

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer--answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory-of-mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain ~15K core items post-filtering, and produce ~152K graded model--item responses. BenchBench shows that benchmark-design ability is only moderately correlated with answer-time strength (Spearman rho ~0.37), invalidity is negatively associated with discrimination (Pearson r~0.62), and the resulting designer--answerer matrices enable scalable audits of format/modality/language fidelity and suite-dependent self/family interactions. The project is available at: https://github.com/koanatakiyo/BenchBench.
Paper Structure (64 sections, 4 equations, 12 figures, 10 tables)

This paper contains 64 sections, 4 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Benchmark creation is shifting from expert curation to human+LLM co-creation; BenchBench benchmarks the next step: auto-benchmarks. It makes generated suites measurable via domain cards, quota-controlled generation, and panel-based validation, yielding designer--answerer matrices for auditing validity, diagnostic utility, and interaction effects. Saturation trends adapted from hendrycks2021mmluepoch2025brief.
  • Figure 2: BenchBench knowledge-guided pipeline. Stage 1 extracts deterministic domain cards from seed benchmarks using text (and vision when applicable) oracles plus offline canonicalization. Stage 2 generates quota-controlled benchmark suites conditioned on the domain card and closes coverage gaps via deficit-driven top-ups. Stage 3 performs post-hoc cleaning, runs a multi-model answerer panel, routes scoring via exact/numeric/symbolic matching or LLM judging, and applies static/dynamic quality gates to produce a core designer$\times$answerer response matrix and downstream evaluation metrics.
  • Figure 3: Validity--discrimination tradeoff across designers (pooled across variants). Broken% is the non-core rate; MeanDiscr is averaged over hard-scored core items.
  • Figure 4: Ranking preservation across variants (Kendall's $\tau$), comparing designer-induced answerer rankings against a reference ranking. Higher values indicate greater preservation of established model order.Variant labels follow Table \ref{['tab:benchbench_overview']}.
  • Figure 5: Self/family bias in the designer--answerer matrix. Left: own-family advantage (accuracy on own-family items minus accuracy on other-family items, percentage points). Right: family-level accuracy matrix (darker = higher accuracy). Cells are aggregated over answerers; rows/columns grouped by model family.
  • ...and 7 more figures