Table of Contents
Fetching ...

To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis

Federico Bianchi, Yongchan Kwon, Zachary Izzo, Linjun Zhang, James Zou

TL;DR

This work tackles the problem of objective mistakes in published AI papers by introducing a GPT-5–based Paper Correctness Checker that automatically detects verifiable errors in formulas, derivations, figures, and related content. Through large-scale sampling of ICLR, NeurIPS, and TMLR papers, the study shows a non-negligible and increasing error burden, with math/formula mistakes dominating and a substantial share of papers containing potentially substantive issues. The system achieves a precision of 83.2% in identifying real mistakes and demonstrates recall around 60% on injected errors, while also providing concrete fixes for the majority of correctable mistakes (75.8% of proposed fixes validated). The results support a hybrid workflow where LLM-based checking helps human reviewers focus on interpretation and significance, offering a practical, low-cost tool to improve reproducibility and reliability in the rapidly expanding AI literature.

Abstract

How many mistakes do published AI papers contain? Peer-reviewed publications form the foundation upon which new research and knowledge are built. Errors that persist in the literature can propagate unnoticed, creating confusion in follow-up studies and complicating reproducibility. The accelerating pace of research and the increasing demands on the peer-review system make such mistakes harder to detect and avoid. To address this, we developed a Paper Correctness Checker based on GPT-5 to systematically identify mistakes in papers previously published at top AI conferences and journals. Our analysis focuses on objective mistakes-e.g., errors in formulas, derivations, calculations, figures, and tables-that have a clearly verifiable ground truth. We intentionally exclude subjective considerations such as novelty, importance, or writing quality. We find that published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time-from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (55.3% increase); from 4.1 in ICLR 2018 to 5.2 in ICLR 2025; and from 5.0 in TMLR 2022/23 to 5.5 in TMLR 2025. Human experts reviewed 316 potential mistakes identified by the AI Checker and confirmed that 263 were actual mistakes, corresponding to a precision of 83.2%. While most identified issues are relatively minor, correcting them would reduce confusion in the literature and strengthen reproducibility. The AI Checker also surfaced potentially more substantive mistakes that could affect the interpretation of results. Moreover, we show that the AI Checker can propose correct fixes for 75.8% of the identified mistakes. Overall, this study highlights the potential of frontier LLMs to detect and correct objective mistakes in published papers, helping to establish a firmer foundation of knowledge.

To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis

TL;DR

This work tackles the problem of objective mistakes in published AI papers by introducing a GPT-5–based Paper Correctness Checker that automatically detects verifiable errors in formulas, derivations, figures, and related content. Through large-scale sampling of ICLR, NeurIPS, and TMLR papers, the study shows a non-negligible and increasing error burden, with math/formula mistakes dominating and a substantial share of papers containing potentially substantive issues. The system achieves a precision of 83.2% in identifying real mistakes and demonstrates recall around 60% on injected errors, while also providing concrete fixes for the majority of correctable mistakes (75.8% of proposed fixes validated). The results support a hybrid workflow where LLM-based checking helps human reviewers focus on interpretation and significance, offering a practical, low-cost tool to improve reproducibility and reliability in the rapidly expanding AI literature.

Abstract

How many mistakes do published AI papers contain? Peer-reviewed publications form the foundation upon which new research and knowledge are built. Errors that persist in the literature can propagate unnoticed, creating confusion in follow-up studies and complicating reproducibility. The accelerating pace of research and the increasing demands on the peer-review system make such mistakes harder to detect and avoid. To address this, we developed a Paper Correctness Checker based on GPT-5 to systematically identify mistakes in papers previously published at top AI conferences and journals. Our analysis focuses on objective mistakes-e.g., errors in formulas, derivations, calculations, figures, and tables-that have a clearly verifiable ground truth. We intentionally exclude subjective considerations such as novelty, importance, or writing quality. We find that published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time-from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (55.3% increase); from 4.1 in ICLR 2018 to 5.2 in ICLR 2025; and from 5.0 in TMLR 2022/23 to 5.5 in TMLR 2025. Human experts reviewed 316 potential mistakes identified by the AI Checker and confirmed that 263 were actual mistakes, corresponding to a precision of 83.2%. While most identified issues are relatively minor, correcting them would reduce confusion in the literature and strengthen reproducibility. The AI Checker also surfaced potentially more substantive mistakes that could affect the interpretation of results. Moreover, we show that the AI Checker can propose correct fixes for 75.8% of the identified mistakes. Overall, this study highlights the potential of frontier LLMs to detect and correct objective mistakes in published papers, helping to establish a firmer foundation of knowledge.

Paper Structure

This paper contains 14 sections, 1 theorem, 18 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

The map $\mathcal{G} \mapsto x_\mathcal{G}$ defined by the equations above is injective if $F^\ell$, $G^\ell$, and $R$ are injective as multiset functions.

Figures (4)

  • Figure 1: AI Checker detected mistakes for papers published in NeurIPS (left column), ICLR (middle), and TMLR (right) across years. The top row shows the average number of detected mistakes per paper, while the bottom row shows the percentage of papers with $\geq 1$ potentially substantive mistake. Error bars represent the standard errors.
  • Figure 2: Percentage of mistakes by category for NeurIPS (left), ICLR (middle), and TMLR (right). The distributions of mistake types are similar across different venues, with math and formula mistakes being the most prominent category in published AI papers. We provide several representative examples the AI Checker identified in Section \ref{['sec:example']}.
  • Figure 3: AI Checker detected mistakes in published NeurIPS papers when reviewing only the first 10 pages of each paper to control for paper length. The average number of detected mistakes per paper (left); The percentage of papers with $\geq 1$ potentially substantive mistake (right). Error bars represent the standard errors. Even after accounting for paper length, the percentage of papers with at least one potentially substantive mistake have increased over time.
  • Figure 4: Human-verified evaluation of our AI Correctness Checker performance. Contingency table of the 263 mistakes identified our AI Checker and confirmed by humans (left) and recall across mistake categories on the 90 injected mistakes (right). Overall, the AI Checker shows relatively high precision in identifying actual mistakes---263 of 316 flagged issues are genuine mistakes. Detecting all the mistakes in a paper is more challenging. Therefore, the number of mistakes identified by the AI Checker can be interpreted a conservative lower estimate, as unflagged mistakes may still remain.

Theorems & Definitions (5)

  • Theorem 1
  • proof
  • Example 1: LLM confused by non-standard notation
  • Example 2: OCR in Math/Formula
  • Example 3: OCR in an Algorithm box