Table of Contents
Fetching ...

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, Dipanjan Das

TL;DR

This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations.

Abstract

We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

TL;DR

This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations.

Abstract

We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.

Paper Structure

This paper contains 20 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Distributions of answer types as a percent of the total number of data points in SimpleQA Verified and SimpleQA. The answer type classification was initially performed by wei2024measuringshortformfactualitylarge.
  • Figure 2: Distributions of question topics as a percent of the total number of data points in SimpleQA Verified and SimpleQA. The topic classification was initially performed by wei2024measuringshortformfactualitylarge.