Table of Contents
Fetching ...

FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers

Sarina Xi, Vishisht Rao, Justin Payan, Nihar B. Shah

TL;DR

FLAWS introduces an automated benchmark that inserts claim-invalidating errors into ICML 2025 papers to evaluate LLMs on error identification and localization. It employs a two-stage error insertion pipeline and a dual evaluation metric (word-level Levenshtein similarity and LLM judging), with reliability validated against human annotators. Across five frontier LLMs, GPT-5 achieves the best top-10 accuracy at 39.1%, but no model reaches 50%, highlighting both the challenge and potential of AI-assisted error localization in scientific writing. The work also discusses scalability, data contamination, and avenues for extending the benchmark to other domains and ongoing model updates.

Abstract

The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers using LLMs, paired with an automated evaluation metric that measures whether models can identify and localize these errors. Developing such a benchmark presents unique challenges that we overcome: ensuring that the inserted errors are well-defined, challenging, and relevant to the content of the paper, avoiding artifacts that would make identification trivial, and designing a scalable, automated evaluation metric. On the resulting benchmark, we evaluate five frontier LLMs: Claude Sonnet 4.5, DeepSeek Reasoner v3.1, Gemini 2.5 Pro, GPT 5, and Grok 4. Among these, GPT 5 is the top-performing model, achieving 39.1% identification accuracy when k=10, where k is the number of top-ranked error text candidates generated by the LLM.

FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers

TL;DR

FLAWS introduces an automated benchmark that inserts claim-invalidating errors into ICML 2025 papers to evaluate LLMs on error identification and localization. It employs a two-stage error insertion pipeline and a dual evaluation metric (word-level Levenshtein similarity and LLM judging), with reliability validated against human annotators. Across five frontier LLMs, GPT-5 achieves the best top-10 accuracy at 39.1%, but no model reaches 50%, highlighting both the challenge and potential of AI-assisted error localization in scientific writing. The work also discusses scalability, data contamination, and avenues for extending the benchmark to other domains and ongoing model updates.

Abstract

The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers using LLMs, paired with an automated evaluation metric that measures whether models can identify and localize these errors. Developing such a benchmark presents unique challenges that we overcome: ensuring that the inserted errors are well-defined, challenging, and relevant to the content of the paper, avoiding artifacts that would make identification trivial, and designing a scalable, automated evaluation metric. On the resulting benchmark, we evaluate five frontier LLMs: Claude Sonnet 4.5, DeepSeek Reasoner v3.1, Gemini 2.5 Pro, GPT 5, and Grok 4. Among these, GPT 5 is the top-performing model, achieving 39.1% identification accuracy when k=10, where k is the number of top-ranked error text candidates generated by the LLM.

Paper Structure

This paper contains 47 sections, 4 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: High-level overview of error insertion as well as error identification and evaluation framework.
  • Figure 2: Predicted identification accuracy, $\Pr(y_k=1)$, across top-$k$ error candidates for each identification model. Shaded regions show 95% confidence intervals.