Table of Contents
Fetching ...

Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

Ernesto Quevedo, Jorge Yero, Rachel Koerner, Pablo Rivas, Tomas Cerny

TL;DR

This paper tackles the problem of hallucinations in large language models (LLMs) by proposing a lightweight, supervised detection framework that relies on four numerical features derived from token and vocabulary probabilities provided by an LLM evaluator (LLM_E), which can be different from the LLM generator (LLM_G). It trains two classifiers, a Logistic Regression and a Simple Neural Network, on forced-decoding transcripts of condition-text–generated-text pairs, and evaluates across three benchmarks (HaluEval, HELM, True-False) using multiple evaluators to study cross-model robustness. The key contributions are (i) demonstrating that a four-feature scheme can achieve competitive or superior performance in several tasks, (ii) showing that using different evaluators can yield robust indicators of hallucination, and (iii) providing a detailed ablation analysis that identifies which features are most informative per task. The practical impact lies in a resource-efficient, cross-model hallucination detector that can be applied to real-time LLM applications, with the public release of code enabling replication and extension across domains. The work also highlights limitations in certain datasets (notably True-False) and suggests avenues for improvement including hybrid approaches and incorporating hidden-layer signals for broader coverage.

Abstract

Concerns regarding the propensity of Large Language Models (LLMs) to produce inaccurate outputs, also known as hallucinations, have escalated. Detecting them is vital for ensuring the reliability of applications relying on LLM-generated content. Current methods often demand substantial resources and rely on extensive LLMs or employ supervised learning with multidimensional features or intricate linguistic and semantic analyses difficult to reproduce and largely depend on using the same LLM that hallucinated. This paper introduces a supervised learning approach employing two simple classifiers utilizing only four numerical features derived from tokens and vocabulary probabilities obtained from other LLM evaluators, which are not necessarily the same. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks. Additionally, we provide a comprehensive examination of the strengths and weaknesses of our approach, highlighting the significance of the features utilized and the LLM employed as an evaluator. We have released our code publicly at https://github.com/Baylor-AI/HalluDetect.

Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

TL;DR

This paper tackles the problem of hallucinations in large language models (LLMs) by proposing a lightweight, supervised detection framework that relies on four numerical features derived from token and vocabulary probabilities provided by an LLM evaluator (LLM_E), which can be different from the LLM generator (LLM_G). It trains two classifiers, a Logistic Regression and a Simple Neural Network, on forced-decoding transcripts of condition-text–generated-text pairs, and evaluates across three benchmarks (HaluEval, HELM, True-False) using multiple evaluators to study cross-model robustness. The key contributions are (i) demonstrating that a four-feature scheme can achieve competitive or superior performance in several tasks, (ii) showing that using different evaluators can yield robust indicators of hallucination, and (iii) providing a detailed ablation analysis that identifies which features are most informative per task. The practical impact lies in a resource-efficient, cross-model hallucination detector that can be applied to real-time LLM applications, with the public release of code enabling replication and extension across domains. The work also highlights limitations in certain datasets (notably True-False) and suggests avenues for improvement including hybrid approaches and incorporating hidden-layer signals for broader coverage.

Abstract

Concerns regarding the propensity of Large Language Models (LLMs) to produce inaccurate outputs, also known as hallucinations, have escalated. Detecting them is vital for ensuring the reliability of applications relying on LLM-generated content. Current methods often demand substantial resources and rely on extensive LLMs or employ supervised learning with multidimensional features or intricate linguistic and semantic analyses difficult to reproduce and largely depend on using the same LLM that hallucinated. This paper introduces a supervised learning approach employing two simple classifiers utilizing only four numerical features derived from tokens and vocabulary probabilities obtained from other LLM evaluators, which are not necessarily the same. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks. Additionally, we provide a comprehensive examination of the strengths and weaknesses of our approach, highlighting the significance of the features utilized and the LLM employed as an evaluator. We have released our code publicly at https://github.com/Baylor-AI/HalluDetect.
Paper Structure (27 sections, 1 equation, 1 figure, 8 tables)