Table of Contents
Fetching ...

Fast and Accurate Factual Inconsistency Detection Over Long Documents

Barrett Martin Lattimer, Patrick Chen, Xinyuan Zhang, Yi Yang

TL;DR

The paper addresses factual inconsistency in long-form generations by introducing SCALE, a chunking-based, NLI-driven detector that conditions on large source chunks and uses a BST-based retrieval strategy for explanations. SCALE achieves state-of-the-art results on the TRUE benchmark and demonstrates strong calibration, efficiency, and explainability on a newly released ScreenEval long-dialogue dataset. The approach balances accuracy, speed, and interpretability, enabling practical online deployment for long documents. The work also contributes ScreenEval as a long-form dialogue inconsistency dataset and provides public code and data to foster reproducibility and further research.

Abstract

Generative AI models exhibit remarkable potential; however, hallucinations across various tasks present a significant challenge, particularly for longer inputs that current approaches struggle to address effectively. We introduce SCALE (Source Chunking Approach for Large-scale inconsistency Evaluation), a task-agnostic model for detecting factual inconsistencies using a novel chunking strategy. Specifically, SCALE is a Natural Language Inference (NLI) based model that uses large text chunks to condition over long texts. This approach achieves state-of-the-art performance in factual inconsistency detection for diverse tasks and long inputs. Additionally, we leverage the chunking mechanism and employ a novel algorithm to explain SCALE's decisions through relevant source sentence retrieval. Our evaluations reveal that SCALE outperforms existing methods on both standard benchmarks and a new long-form dialogue dataset ScreenEval we constructed. Moreover, SCALE surpasses competitive systems in efficiency and model explanation evaluations. We have released our code and data publicly to GitHub.

Fast and Accurate Factual Inconsistency Detection Over Long Documents

TL;DR

The paper addresses factual inconsistency in long-form generations by introducing SCALE, a chunking-based, NLI-driven detector that conditions on large source chunks and uses a BST-based retrieval strategy for explanations. SCALE achieves state-of-the-art results on the TRUE benchmark and demonstrates strong calibration, efficiency, and explainability on a newly released ScreenEval long-dialogue dataset. The approach balances accuracy, speed, and interpretability, enabling practical online deployment for long documents. The work also contributes ScreenEval as a long-form dialogue inconsistency dataset and provides public code and data to foster reproducibility and further research.

Abstract

Generative AI models exhibit remarkable potential; however, hallucinations across various tasks present a significant challenge, particularly for longer inputs that current approaches struggle to address effectively. We introduce SCALE (Source Chunking Approach for Large-scale inconsistency Evaluation), a task-agnostic model for detecting factual inconsistencies using a novel chunking strategy. Specifically, SCALE is a Natural Language Inference (NLI) based model that uses large text chunks to condition over long texts. This approach achieves state-of-the-art performance in factual inconsistency detection for diverse tasks and long inputs. Additionally, we leverage the chunking mechanism and employ a novel algorithm to explain SCALE's decisions through relevant source sentence retrieval. Our evaluations reveal that SCALE outperforms existing methods on both standard benchmarks and a new long-form dialogue dataset ScreenEval we constructed. Moreover, SCALE surpasses competitive systems in efficiency and model explanation evaluations. We have released our code and data publicly to GitHub.
Paper Structure (33 sections, 5 equations, 7 figures, 7 tables)

This paper contains 33 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Chunking mechanism for SCALE to produce a score given a source document and generated text. The source document is broken into chunks (represented by dashed lines) and each chunk is fed into to a prompt as the premise. The highlighted generated text is fed into all prompts as the hypothesis. Each prompt is then run through Flan-T5 and the resulting logits are used to compute the entailment score.
  • Figure 2: Visualization of SCALE's in context retrieval using chunks to find the most relevant source utterance given a sentence. Each chunk is scored by SCALE, as shown by the gray boxes.
  • Figure 3: Calibration curves on the PAWS benchmark
  • Figure 4: Effect of different chunk sizes on calibration performance on ScreenEval dataset.
  • Figure 5: Effect of different chunk sizes on $\text{SCALE}_{large}$ performance and time on ScreenEval dataset. ROC$\_$AUC score is multiplied by 100 for readability.
  • ...and 2 more figures