Zero-shot Factual Consistency Evaluation Across Domains
Raunak Agarwal
TL;DR
The paper tackles factual inconsistency in conditional text generation by unifying evaluation across domains through training on a broad pool of NLI-like datasets to assess source-target factual alignment as a binary task. It finetunes small, long-context models (e.g., FLAN-T5-Base/Large) to enable single-pass factuality assessment over contexts up to 2048 tokens, and evaluates on a heterogeneous 22-dataset benchmark against 8 baselines. Results show state-of-the-art cross-domain factual consistency performance, highlighting both strong generalization and remaining challenges in dialogue summarization and citation-grounded tasks. The work emphasizes efficiency, cross-domain generalization, and the need for updated, multilingual benchmarks to maintain reliable fact-checking in an era of widespread LLM usage.
Abstract
This work addresses the challenge of factual consistency in text generation systems. We unify the tasks of Natural Language Inference, Summarization Evaluation, Factuality Verification and Factual Consistency Evaluation to train models capable of evaluating the factual consistency of source-target pairs across diverse domains. We rigorously evaluate these against eight baselines on a comprehensive benchmark suite comprising 22 datasets that span various tasks, domains, and document lengths. Results demonstrate that our method achieves state-of-the-art performance on this heterogeneous benchmark while addressing efficiency concerns and attaining cross-domain generalization.
