Table of Contents
Fetching ...

On Finding Inconsistencies in Documents

Charles J. Lovering, Seth Ebner, Brandon Smock, Michael Krumdick, Saad Rabbani, Ahmed Muhammad, Varshini Reddy, Chris Tanner

TL;DR

The paper introduces FIND, a benchmark for detecting internal inconsistencies in long, technical documents using expert-inserted errors across finance and cs.CL content. It formalizes a three-part evaluation (evidence, description, and full task) and reports that GPT-5 and other large models achieve around 60–64% recall on inserted inconsistencies while also surfacing undiscovered issues with notable precision. The dataset combines diverse sources (BLS, PRE, SEC, EMM, PG, cs.CL, MFR) and emphasizes long-context challenges, with findings that document length and inconsistency type influence performance. Overall, the work demonstrates both the potential and current limits of automated inconsistency detection in real-world, high-stakes documents, highlighting directions for improving reliability and practical deployment in auditing workflows.

Abstract

Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.

On Finding Inconsistencies in Documents

TL;DR

The paper introduces FIND, a benchmark for detecting internal inconsistencies in long, technical documents using expert-inserted errors across finance and cs.CL content. It formalizes a three-part evaluation (evidence, description, and full task) and reports that GPT-5 and other large models achieve around 60–64% recall on inserted inconsistencies while also surfacing undiscovered issues with notable precision. The dataset combines diverse sources (BLS, PRE, SEC, EMM, PG, cs.CL, MFR) and emphasizes long-context challenges, with findings that document length and inconsistency type influence performance. Overall, the work demonstrates both the potential and current limits of automated inconsistency detection in real-world, high-stakes documents, highlighting directions for improving reliability and practical deployment in auditing workflows.

Abstract

Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.

Paper Structure

This paper contains 69 sections, 1 equation, 32 figures, 16 tables.

Figures (32)

  • Figure 1: The task is to find inconsistencies within an input document. For each inconsistency, the model: (a) identifies the evidence spans that constitute the inconsistency and (b) generates a corresponding description.
  • Figure 2: Numeric inconsistency from EMM.
  • Figure 3: Non-numeric inconsistency from SEC.
  • Figure 4: Structural inconsistency from BLS.
  • Figure 5: Absolute Evidence Locations and Test Document Lengths. Each row corresponds to a document, with each row ending at a gray square indicating document length. Purple squares mark evidence locations. Black squares indicate that the document exceeds the displayed length. (For BLS and PRE the axis ends at $5\times10^4$.)
  • ...and 27 more figures