RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models

Xiangkun Hu; Dongyu Ru; Lin Qiu; Qipeng Guo; Tianhang Zhang; Yang Xu; Yun Luo; Pengfei Liu; Yue Zhang; Zheng Zhang

RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models

Xiangkun Hu, Dongyu Ru, Lin Qiu, Qipeng Guo, Tianhang Zhang, Yang Xu, Yun Luo, Pengfei Liu, Yue Zhang, Zheng Zhang

TL;DR

This work tackles the challenge of hallucinations in large language models by introducing claim-triplets as fine-grained checking units and the RefChecker framework, which comprising a claim extractor and a verifier that compare triplets against references. It builds a comprehensive benchmark across Zero, Noisy, and Accurate Context settings, annotated with 11k triplets from 2.1k responses across seven LLMs, and demonstrates superior detection performance over previous granularities and methods. The study also provides a practical, open-source pipeline that supports both proprietary and open-source extractors and checkers, with robust correlations to human judgments. Overall, RefChecker advances reliable hallucination detection and offers actionable guidance for deploying fine-grained verification in real-world NLP tasks.

Abstract

Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. We delineate three task settings: Zero, Noisy and Accurate Context, to reflect various real-world use cases. We curated a benchmark spanning various NLP tasks and annotated 11k claim-triplets from 2.1k responses by seven LLMs. RefChecker supports both proprietary and open-source models as the extractor and checker. Experiments demonstrate that claim-triplets enable superior hallucination detection, compared to other granularities such as response, sentence and sub-sentence level claims. RefChecker outperforms prior methods by 6.8 to 26.1 points on our benchmark and the checking results of RefChecker are strongly aligned with human judgments. This work is open sourced at https://github.com/amazon-science/RefChecker

RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models

TL;DR

Abstract

Paper Structure (45 sections, 1 equation, 22 figures, 16 tables)

This paper contains 45 sections, 1 equation, 22 figures, 16 tables.

Introduction
Related Work
Hallucinations in LLMs
Granularity of Claims
Hallucination Checking
Hallucination Detection Benchmarks
RefChecker: Definition and Benchmark
Context Settings and Benchmarks
Zero Context (ZC)
Noisy Context (NC)
Accurate Context (AC)
Claim-Triplets and Definition of Hallucination
Human Evaluation
RefChecker Framework
Extractor
...and 30 more sections

Figures (22)

Figure 1: An example response split into sentence, sub-sentencemin-etal-2023-factscore, triplets, and the hallucination 1983. Triplets define the boundary of claims more clearly, are fine-grained and covers non-overlapping facts (unlike sub-sentences).
Figure 2: The RefChecker framework comprises two main components: an extractor denoted as $E$ and a checker denoted as $C$. Given a text to be checked, typically a response generated by an LLM, the extractor takes it as input and generates a set of knowledge triplets, referred to as claim-triplets. Subsequently, the checker assesses each claim-triplet by comparing it against a reference, assigning a hallucination label based on the evaluation.
Figure 3: Illustration of three settings of context, tasks and references. Zero Context is about seeking factual knowledge from the internal memory of the LLMs. Noisy Context has context information retrieved from a knowledge source, which is a RAG use case. Accurate Context has context provided in the input prompt. For Noisy and Accurate Context, we take the input context as the reference.
Figure 4: Definition of fine-grained hallucinations in an LLM-generated response compared with references. The intersections of the response and the references are the claims either supported (Entailment) or refuted (Contradiction) by the references. The remaining parts in the response are claims not verifiable by the references (Neutral). The other parts of the references are the content not mentioned in the response.
Figure 5: Performance statistics of 7 checkers under different claim granularities on 2.1k manual annotated responses. The detailed checker performance can be found in Table \ref{['tab:granularity']} of Appendix \ref{['appendix:refchecker_checker']}.
...and 17 more figures

RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models

TL;DR

Abstract

RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (22)