Evaluating LLMs' Assessment of Mixed-Context Hallucination Through the Lens of Summarization
Siya Qi, Rui Cao, Yulan He, Zheng Yuan
TL;DR
This paper tackles the challenge of evaluating mixed-context hallucinations in summarization using LLMs as judges. It introduces FHSumBench, an automated pipeline that injects factual or non-factual knowledge into correct summaries to create a balanced, scalable dataset for assessing faithfulness and factuality. The study systematically compares direct-generation and retrieval-based evaluators across models and prompting strategies, revealing that external knowledge retrieval and prompt design can significantly improve detection, while scaling alone offers limited gains and intrinsic knowledge bias remains a bottleneck. The findings underscore the importance of effective knowledge integration and retrieval strategies for robust LLM-based evaluation, with practical implications for building reliable self-evaluating systems and benchmarks in NLP.
Abstract
With the rapid development of large language models (LLMs), LLM-as-a-judge has emerged as a widely adopted approach for text quality evaluation, including hallucination evaluation. While previous studies have focused exclusively on single-context evaluation (e.g., discourse faithfulness or world factuality), real-world hallucinations typically involve mixed contexts, which remains inadequately evaluated. In this study, we use summarization as a representative task to comprehensively evaluate LLMs' capability in detecting mixed-context hallucinations, specifically distinguishing between factual and non-factual hallucinations. Through extensive experiments across direct generation and retrieval-based models of varying scales, our main observations are: (1) LLMs' intrinsic knowledge introduces inherent biases in hallucination evaluation; (2) These biases particularly impact the detection of factual hallucinations, yielding a significant performance bottleneck; (3) The fundamental challenge lies in effective knowledge utilization, balancing between LLMs' intrinsic knowledge and external context for accurate mixed-context hallucination evaluation.
