Can LLMs Detect Their Own Hallucinations?

Sora Kadotani; Kosuke Nishida; Kyosuke Nishida

Can LLMs Detect Their Own Hallucinations?

Sora Kadotani, Kosuke Nishida, Kyosuke Nishida

TL;DR

This work investigates whether large language models can detect their own hallucinations by reframing detection as a sentence-classification task. It introduces a three-stage framework that generates true and false sentences from relational triples and uses a CoT-enabled classifier to determine truth, demonstrating that GPT-3.5-T achieves 58.2% recall with CoT, up from 21.9% without CoT. The results show a positive link between the model's embedded knowledge and detection performance, underscoring the importance of pretraining data and parameter content. The framework offers a practical, single-LLM approach to self-verification and provides a foundation for future work on area-specific hallucination analysis and reduction.

Abstract

Large language models (LLMs) can generate fluent responses, but sometimes hallucinate facts. In this paper, we investigate whether LLMs can detect their own hallucinations. We formulate hallucination detection as a classification task of a sentence. We propose a framework for estimating LLMs' capability of hallucination detection and a classification method using Chain-of-Thought (CoT) to extract knowledge from their parameters. The experimental results indicated that GPT-$3.5$ Turbo with CoT detected $58.2\%$ of its own hallucinations. We concluded that LLMs with CoT can detect hallucinations if sufficient knowledge is contained in their parameters.

Can LLMs Detect Their Own Hallucinations?

TL;DR

Abstract

Can LLMs Detect Their Own Hallucinations?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)