Table of Contents
Fetching ...

HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection

Xuefeng Du, Chaowei Xiao, Yixuan Li

TL;DR

HaloScope is introduced, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection, and presents an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top.

Abstract

The surge in applications of large language models (LLMs) has prompted concerns about the generation of misleading or fabricated information, known as hallucinations. Therefore, detecting hallucinations has become critical to maintaining trust in LLM-generated content. A primary challenge in learning a truthfulness classifier is the lack of a large amount of labeled truthful and hallucinated data. To address the challenge, we introduce HaloScope, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection. Such unlabeled data arises freely upon deploying LLMs in the open world, and consists of both truthful and hallucinated information. To harness the unlabeled data, we present an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top. Importantly, our framework does not require extra data collection and human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiments show that HaloScope can achieve superior hallucination detection performance, outperforming the competitive rivals by a significant margin. Code is available at https://github.com/deeplearningwisc/haloscope.

HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection

TL;DR

HaloScope is introduced, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection, and presents an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top.

Abstract

The surge in applications of large language models (LLMs) has prompted concerns about the generation of misleading or fabricated information, known as hallucinations. Therefore, detecting hallucinations has become critical to maintaining trust in LLM-generated content. A primary challenge in learning a truthfulness classifier is the lack of a large amount of labeled truthful and hallucinated data. To address the challenge, we introduce HaloScope, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection. Such unlabeled data arises freely upon deploying LLMs in the open world, and consists of both truthful and hallucinated information. To harness the unlabeled data, we present an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top. Importantly, our framework does not require extra data collection and human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiments show that HaloScope can achieve superior hallucination detection performance, outperforming the competitive rivals by a significant margin. Code is available at https://github.com/deeplearningwisc/haloscope.
Paper Structure (36 sections, 10 equations, 7 figures, 10 tables)

This paper contains 36 sections, 10 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Illustration of our proposed framework HaloScope for hallucination detection, leveraging unlabeled LLM generations in the wild. HaloScope first identifies the latent subspace to estimate the membership (truthful vs. hallucinated) for samples in unlabeled data $\mathcal{M}$ and then learns a binary truthfulness classifier.
  • Figure 2: Visualization of the representations for truthful (in orange) and hallucinated samples (in purple), and their projection onto the top singular vector $\mathbf{v}_1$ (in gray dashed line).
  • Figure 3: (a) Generalization across four datasets, where "(s)" denotes the source dataset and "(t)" denotes the target dataset. (b) Effect of the number of subspace components $k$ (Section \ref{['sec:step_1']}). (c) Impact of different layers. All numbers are AUROC based on LLaMA-2-7b-chat. Ablation in (b) & (c) are based on TruthfulQA.
  • Figure 4: Comparison with using direction projection for hallucination detection. Value is AUROC.
  • Figure 5: Comparison with ideal performance when training on labeled data.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 2.1: LLM generation
  • Definition 2.2: Hallucination detection
  • Definition 3.1: Unlabeled data distribution
  • Definition 3.2: Empirical dataset