Table of Contents
Fetching ...

MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

Jaime Raldua Veuthey, Zainab Ali Majid, Suhas Hariharan, Jacob Haimes

TL;DR

This paper introduces MEQA, a meta-evaluation framework that aggregates eight criteria into 44 sub-criteria to rate QA benchmarks on a scale from $1$ to $5$, addressing the need for standardized, reproducible benchmark quality assessments. It demonstrates MEQA on cybersecurity QA benchmarks by combining three human evaluators with an LLM evaluator (GPT-4o), showing that automated scoring can closely track human judgments and enable scalable meta-analysis. The results indicate strong performance in reproducibility and comparability across benchmarks, but notable weaknesses in prompt robustness and reliability, with high inter-criterion variability. The work argues for using meta-evaluations to guide benchmark development and encourages expanding MEQA to additional domains to ensure robust, trustworthy LLM evaluation practices and more effective gap analysis.

Abstract

As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.

MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks

TL;DR

This paper introduces MEQA, a meta-evaluation framework that aggregates eight criteria into 44 sub-criteria to rate QA benchmarks on a scale from to , addressing the need for standardized, reproducible benchmark quality assessments. It demonstrates MEQA on cybersecurity QA benchmarks by combining three human evaluators with an LLM evaluator (GPT-4o), showing that automated scoring can closely track human judgments and enable scalable meta-analysis. The results indicate strong performance in reproducibility and comparability across benchmarks, but notable weaknesses in prompt robustness and reliability, with high inter-criterion variability. The work argues for using meta-evaluations to guide benchmark development and encourages expanding MEQA to additional domains to ensure robust, trustworthy LLM evaluation practices and more effective gap analysis.

Abstract

As Large Language Models (LLMs) advance, their potential for widespread societal impact grows simultaneously. Hence, rigorous LLM evaluations are both a technical necessity and social imperative. While numerous evaluation benchmarks have been developed, there remains a critical gap in meta-evaluation: effectively assessing benchmarks' quality. We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks, to provide standardized assessments, quantifiable scores, and enable meaningful intra-benchmark comparisons. We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators, highlighting the benchmarks' strengths and weaknesses. We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.

Paper Structure

This paper contains 19 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Scores of cybersecurity benchmarks per criterion. N/A indicates inapplicable criteria (e.g. SECURE uses pre-defined correct answers; evaluator design does not apply).
  • Figure 2: Scores of cybersecurity benchmarks across memorization robustness sub-criteria.
  • Figure 3: Scores of cybersecurity benchmarks across prompt robustness sub-criteria.
  • Figure 4: Scores of cybersecurity benchmarks across evaluation design sub-criteria.
  • Figure 5: Scores of cybersecurity benchmarks across evaluator design sub-criteria.
  • ...and 4 more figures