Table of Contents
Fetching ...

FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Gabriel Recchia, Chatrik Singh Mangat, Issac Li, Gayatri Krishnakumar

TL;DR

FindTheFlaws presents five expert-annotated datasets across medicine, mathematics, science, coding, and Lojban to study scalable oversight of AI, providing long-form correct and flawed solutions with error annotations. The work evaluates frontier models on two tasks: judging solution correctness (match) and identifying the specific errors (grading) across domains, revealing domain-specific strengths and saturation patterns. Notably, some expert baselines exceed top models on certain tasks, and CELS Lojban remains unsaturated, highlighting the need for robust verification benchmarks. The datasets support analysis of debate, critique, and prover-verifier approaches and aim to guide the development of scalable oversight as AI capabilities advance.

Abstract

As AI models tackle increasingly complex problems, ensuring reliable human oversight becomes more challenging due to the difficulty of verifying solutions. Approaches to scaling AI supervision include debate, in which two agents engage in structured dialogue to help a judge evaluate claims; critique, in which models identify potential flaws in proposed solutions; and prover-verifier games, in which a capable 'prover' model generates solutions that must be verifiable by a less capable 'verifier'. Evaluations of the scalability of these and similar approaches to difficult problems benefit from datasets that include (1) long-form expert-verified correct solutions and (2) long-form flawed solutions with annotations highlighting specific errors, but few are available. To address this gap, we present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. Each dataset contains questions and long-form solutions with expert annotations validating their correctness or identifying specific error(s) in the reasoning. We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments: models performing more poorly on particular datasets can serve as judges/verifiers for more capable models. Additionally, for some task/dataset combinations, expert baselines exceed even top model performance, making them more beneficial for scalable oversight experiments.

FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

TL;DR

FindTheFlaws presents five expert-annotated datasets across medicine, mathematics, science, coding, and Lojban to study scalable oversight of AI, providing long-form correct and flawed solutions with error annotations. The work evaluates frontier models on two tasks: judging solution correctness (match) and identifying the specific errors (grading) across domains, revealing domain-specific strengths and saturation patterns. Notably, some expert baselines exceed top models on certain tasks, and CELS Lojban remains unsaturated, highlighting the need for robust verification benchmarks. The datasets support analysis of debate, critique, and prover-verifier approaches and aim to guide the development of scalable oversight as AI capabilities advance.

Abstract

As AI models tackle increasingly complex problems, ensuring reliable human oversight becomes more challenging due to the difficulty of verifying solutions. Approaches to scaling AI supervision include debate, in which two agents engage in structured dialogue to help a judge evaluate claims; critique, in which models identify potential flaws in proposed solutions; and prover-verifier games, in which a capable 'prover' model generates solutions that must be verifiable by a less capable 'verifier'. Evaluations of the scalability of these and similar approaches to difficult problems benefit from datasets that include (1) long-form expert-verified correct solutions and (2) long-form flawed solutions with annotations highlighting specific errors, but few are available. To address this gap, we present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. Each dataset contains questions and long-form solutions with expert annotations validating their correctness or identifying specific error(s) in the reasoning. We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments: models performing more poorly on particular datasets can serve as judges/verifiers for more capable models. Additionally, for some task/dataset combinations, expert baselines exceed even top model performance, making them more beneficial for scalable oversight experiments.

Paper Structure

This paper contains 38 sections, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Performance of each model, as well as expert baselines, on match and grading metrics for Adversarial MedQA, Modified TheoremQA, and GPQA Diamond Plus. Expert baselines for Adversarial MedQA represent the performance of a human clinician, while baselines for the other two datasets (available for error grading only) represent agreement between o3-mini and the solution authors about the location of the first error when o3-mini is provided with the labeled correct and flawed solutions developed by the solution authors (Appendix \ref{['subsubsec:baselines_theoremqa_gpqa']}). 95% confidence intervals were calculated using a cluster-based block bootstrap approach.
  • Figure 2: Performance of each model, as well as human expert baselines (Appendix \ref{['subsubsec:baselines_python']}), on match task for Python650 and Meta-Python650, and on the error grading task for two subsets of Meta-Python650. 95% confidence intervals were calculated using a cluster-based block bootstrap approach.
  • Figure 3: Performance of each model, as well as expert baselines, on match and error grading metrics for CELS Surgery, Law, and Lojban. Baselines represent the performance of a single human expert for CELS Law and CELS Lojban, and of a majority vote of three clinicians for CELS Surgery (Appendix \ref{['subsubsec:baselines_cels']}). 95% confidence intervals were calculated using a cluster-based block bootstrap approach.