Fast and Accurate Factual Inconsistency Detection Over Long Documents
Barrett Martin Lattimer, Patrick Chen, Xinyuan Zhang, Yi Yang
TL;DR
The paper addresses factual inconsistency in long-form generations by introducing SCALE, a chunking-based, NLI-driven detector that conditions on large source chunks and uses a BST-based retrieval strategy for explanations. SCALE achieves state-of-the-art results on the TRUE benchmark and demonstrates strong calibration, efficiency, and explainability on a newly released ScreenEval long-dialogue dataset. The approach balances accuracy, speed, and interpretability, enabling practical online deployment for long documents. The work also contributes ScreenEval as a long-form dialogue inconsistency dataset and provides public code and data to foster reproducibility and further research.
Abstract
Generative AI models exhibit remarkable potential; however, hallucinations across various tasks present a significant challenge, particularly for longer inputs that current approaches struggle to address effectively. We introduce SCALE (Source Chunking Approach for Large-scale inconsistency Evaluation), a task-agnostic model for detecting factual inconsistencies using a novel chunking strategy. Specifically, SCALE is a Natural Language Inference (NLI) based model that uses large text chunks to condition over long texts. This approach achieves state-of-the-art performance in factual inconsistency detection for diverse tasks and long inputs. Additionally, we leverage the chunking mechanism and employ a novel algorithm to explain SCALE's decisions through relevant source sentence retrieval. Our evaluations reveal that SCALE outperforms existing methods on both standard benchmarks and a new long-form dialogue dataset ScreenEval we constructed. Moreover, SCALE surpasses competitive systems in efficiency and model explanation evaluations. We have released our code and data publicly to GitHub.
