Towards Unification of Hallucination Detection and Fact Verification for Large Language Models
Weihang Su, Jianming Long, Changyue Wang, Shiyu Lin, Jingyan Xu, Ziyi Ye, Qingyao Ai, Yiqun Liu
TL;DR
This work addresses the fragmentation between fact verification and hallucination detection in LLMs by introducing UniFact, a dynamic, unified evaluation framework that generates outputs on the fly and labels factuality through a reference-based judge. Through large-scale experiments across diverse models and datasets, the study reveals that FV and HD are complementary rather than redundant, and that simple hybrids that combine both signals consistently outperform either paradigm alone. The paper also analyzes why FV and HD diverged, demonstrating retrieval dependence for FV and semantic variability for HD, and presents two practical hybrid designs—the Score-Level Fusion and the Evidence-Aware Pipeline—that achieve new state-of-the-art performance. Overall, UniFact promotes a paradigm shift toward integrated factuality assessment, enabling robust, model-agnostic evaluation and grounding guidance for building more trustworthy AI systems.
Abstract
Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their adoption in real-world applications. To address this challenge, two distinct research paradigms have emerged: model-centric Hallucination Detection (HD) and text-centric Fact Verification (FV). Despite sharing the same goal, these paradigms have evolved in isolation, using distinct assumptions, datasets, and evaluation protocols. This separation has created a research schism that hinders their collective progress. In this work, we take a decisive step toward bridging this divide. We introduce UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Through large-scale experiments across multiple LLM families and detection methods, we reveal three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; and (3) hybrid approaches that integrate both methods consistently achieve state-of-the-art performance. Beyond benchmarking, we provide the first in-depth analysis of why FV and HD diverged, as well as empirical evidence supporting the need for their unification. The comprehensive experimental results call for a new, integrated research agenda toward unifying Hallucination Detection and Fact Verification in LLMs. We have open-sourced all the code, data, and baseline implementation at: https://github.com/oneal2000/UniFact/
