SIFiD: Reassess Summary Factual Inconsistency Detection with LLM
Jiuding Yang, Hui Liu, Weidong Guo, Zhuwei Rao, Yu Xu, Di Niu
TL;DR
The paper reevaluates summary factual inconsistency detection with the latest GPT models on SummaC and introduces SIFiD, a method that filters the document down to its most relevant sentences using either entailment or semantic similarity before performing inconsistency detection. GPT-4 significantly improves performance over GPT-3.5, and SIFiD further enhances detection by focusing on relevant content, achieving higher accuracy with reduced input size. Benchmark-specific prompt templates are shown to be crucial for capitalizing on LLM capabilities. The work demonstrates meaningful improvements in factual consistency detection and provides open-source code to foster reproducibility and further research, with practical implications for efficient and scalable evaluation in summarization systems.
Abstract
Ensuring factual consistency between the summary and the original document is paramount in summarization tasks. Consequently, considerable effort has been dedicated to detecting inconsistencies. With the advent of Large Language Models (LLMs), recent studies have begun to leverage their advanced language understanding capabilities for inconsistency detection. However, early attempts have shown that LLMs underperform traditional models due to their limited ability to follow instructions and the absence of an effective detection methodology. In this study, we reassess summary inconsistency detection with LLMs, comparing the performances of GPT-3.5 and GPT-4. To advance research in LLM-based inconsistency detection, we propose SIFiD (Summary Inconsistency Detection with Filtered Document) that identify key sentences within documents by either employing natural language inference or measuring semantic similarity between summaries and documents.
