Cost-Effective Hallucination Detection for LLMs

Simon Valentin; Jinmiao Fu; Gianluca Detommaso; Shaoyuan Xu; Giovanni Zappella; Bryan Wang

Cost-Effective Hallucination Detection for LLMs

Simon Valentin, Jinmiao Fu, Gianluca Detommaso, Shaoyuan Xu, Giovanni Zappella, Bryan Wang

TL;DR

The paper tackles the challenge of post-hoc hallucination detection in LLMs by proposing a model-agnostic pipeline that yields calibrated probability scores for hallucinations and thresholding. It evaluates a wide range of single- and multi-generation scoring methods across multiple datasets and LLMs, showing that no single score is universally best and that calibrated multi-score ensembles improve detection. It introduces cost-effective multi-scoring to balance detection performance with computational budget, often matching or surpassing more expensive methods. The results highlight the importance of calibration and diverse signals, and point to practical deployment benefits for real-world LLM applications.

Abstract

Large language models (LLMs) can be prone to hallucinations - generating unreliable outputs that are unfaithful to their inputs, external facts or internally inconsistent. In this work, we address several challenges for post-hoc hallucination detection in production settings. Our pipeline for hallucination detection entails: first, producing a confidence score representing the likelihood that a generated answer is a hallucination; second, calibrating the score conditional on attributes of the inputs and candidate response; finally, performing detection by thresholding the calibrated score. We benchmark a variety of state-of-the-art scoring methods on different datasets, encompassing question answering, fact checking, and summarization tasks. We employ diverse LLMs to ensure a comprehensive assessment of performance. We show that calibrating individual scoring methods is critical for ensuring risk-aware downstream decision making. Based on findings that no individual score performs best in all situations, we propose a multi-scoring framework, which combines different scores and achieves top performance across all datasets. We further introduce cost-effective multi-scoring, which can match or even outperform more expensive detection methods, while significantly reducing computational overhead.

Cost-Effective Hallucination Detection for LLMs

TL;DR

Abstract

Paper Structure (36 sections, 4 equations, 5 figures, 4 tables)

This paper contains 36 sections, 4 equations, 5 figures, 4 tables.

Introduction
Detecting LLM Hallucinations
Formalizing Hallucination Detection
Scoring Methods
Single-generation
Inverse Perplexity
P(True)
NLI Text Classification
Verbalized Probabilities
Multi-generation
SelfCheckGPT
Similarity Degree
NeMO Guardrails: Hallucination Rail
Calibration
Multi-Scoring: Combining Scores
...and 21 more sections

Figures (5)

Figure 1: Schematic overview of our proposed hallucination detection approach.
Figure 2: Hallucination detection F1 versus computational budget $B$ for cost-effective multi-scoring.
Figure 3: Relationship between number of generations used for SelfCheckGPT and performance of cost-effective multi-score vs SelfCheckGPT alone on TriviaQA.
Figure 4: Heatmap of Spearman rank correlations between scores on TriviaQA.
Figure 5: Example of multi-generation failure-case in NLP systems, illustrating conflicting responses.

Cost-Effective Hallucination Detection for LLMs

TL;DR

Abstract

Cost-Effective Hallucination Detection for LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)