Table of Contents
Fetching ...

YourBench: Easy Custom Evaluation Sets for Everyone

Sumuk Shashidhar, Clémentine Fourrier, Alina Lozovskia, Thomas Wolf, Gokhan Tur, Dilek Hakkani-Tür

TL;DR

This work tackles the inefficiencies of LLM evaluation by introducing YourBench, an open-source framework that automatically generates dynamic, domain-specific evaluation sets grounded in user-provided documents via Document-to-Evaluation Generation (D2EG). The pipeline preprocesses diverse documents, creates semantic chunks, and uses an ensemble of LLMs to generate QA pairs with verifiable citations, followed by rigorous quality filtering and deduplication. Key results show that YourBench can replicate MMLU subsets across seven domains with perfect relative ranking (Spearman $ ho$ = 1.00) at low cost (inference < $15 per domain) and improves challenge level, aided by the Tempora-0325 dataset (post-2025 sources) to mitigate contamination; across 26 SOTA models, the framework yields approximately 85% validity for generated questions and delivers 150k+ QA pairs for reproducible benchmarking. The open-source release enables reproducible, on-demand, domain-tailored benchmarks, promoting more timely, relevant, and trustworthy LLM evaluation while supporting diverse applications from domain knowledge assessment to education and RAG data generation.

Abstract

Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.

YourBench: Easy Custom Evaluation Sets for Everyone

TL;DR

This work tackles the inefficiencies of LLM evaluation by introducing YourBench, an open-source framework that automatically generates dynamic, domain-specific evaluation sets grounded in user-provided documents via Document-to-Evaluation Generation (D2EG). The pipeline preprocesses diverse documents, creates semantic chunks, and uses an ensemble of LLMs to generate QA pairs with verifiable citations, followed by rigorous quality filtering and deduplication. Key results show that YourBench can replicate MMLU subsets across seven domains with perfect relative ranking (Spearman = 1.00) at low cost (inference < $15 per domain) and improves challenge level, aided by the Tempora-0325 dataset (post-2025 sources) to mitigate contamination; across 26 SOTA models, the framework yields approximately 85% validity for generated questions and delivers 150k+ QA pairs for reproducible benchmarking. The open-source release enables reproducible, on-demand, domain-tailored benchmarks, promoting more timely, relevant, and trustworthy LLM evaluation while supporting diverse applications from domain knowledge assessment to education and RAG data generation.

Abstract

Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.

Paper Structure

This paper contains 70 sections, 18 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: YourBench Automatically Generates Challenging MMLU Replicas. We evaluated YourBench's ability to replicate subsets of the MMLU benchmark across 7 diverse domains (Astronomy, Anatomy, etc.). Using only a few relevant Wikipedia pages per domain as input documents, YourBench automatically generated new multiple-choice question sets in the MMLU style. This process took <5 minutes and <$2 of inference cost per domain, requiring no human annotation. The resulting benchmarks (orange bars) demonstrate two key findings: (1) They perfectly preserve the relative performance rankings of various LLMs compared to the original MMLU (grey bars), confirming evaluation validity (Spearman $\rho$=1.00). (2) They consistently produce harder questions (lower absolute scores), yielding a more challenging, contamination-resistant evaluation derived directly from source material.
  • Figure 2: The Validity-Diversity Spectrum of Language Models. Comparing semantic diversity scores (left) and human-annotated validity scores (right) for questions generated by various models reveals an intriguing trade-off. Models like o3 mini excel in validity (generating consistently answerable, clear questions) but exhibit low diversity, often focusing on routine or algorithmic queries - when models like Qwen2.5 32B achieve high diversity but may do so at the cost of slightly lower average validity. Some rare models, like DeepSeek V3, demonstrate a strong balance, scoring well on both dimensions.
  • Figure 3: Evaluation of citation grounding performance. (a) Compares aggregate citation scores across various models. (b) Illustrates the Pareto frontier for inference cost (log scale) versus citation score, highlighting efficiency trade-offs. Full model list in Appendix \ref{['appendix:model_list']}.
  • Figure 4: Comparison of generated MMLU style questions in various domains.
  • Figure 5: Overview of the YourBench Framework: A dynamic pipeline starting from diverse documents, through preprocessing (ingestion, chunking, summarization - §\ref{['sec:methods:preprocessing']}), LLM-driven question generation (following D2EG principles - §\ref{['sec:generation']}), quality filtering (citation validation, deduplication - §\ref{['sec:quality_filtering']}), to automated evaluation using an LLM judge ensemble (§\ref{['sec:experimentation']}).
  • ...and 10 more figures