Table of Contents
Fetching ...

Detecting Hallucinations in Authentic LLM-Human Interactions

Yujie Ren, Niklas Gruhlke, Anne Lauscher

TL;DR

This work introduces AuthenHallu, the first hallucination-detection benchmark constructed entirely from authentic LLM–human interactions, addressing the realism gap in prior benchmarks. It provides a corpus of 400 authentic dialogues (800 query–response pairs) annotated for hallucination occurrence and categorized into input-, context-, and fact-conflicting types, with 31.4% of pairs hallucinated and 60% of hallucinations in Math & Number Problems. The authors evaluate six vanilla LLMs in zero-shot settings across detection and categorization tasks, plus ensemble and in-context variants, finding that vanilla detectors show limited reliability (best F1 around 64% in detection) and that ensembles do not consistently surpass the best single model. They also reveal that categorization is particularly challenging, with substantial cross-model variability and fact-conflicting hallucinations being relatively easier to detect. Overall, AuthenHallu provides a realistic, topic-sensitive benchmark that highlights the current limitations of vanilla LLMs for hallucination detection and categorization in authentic usage, informing future improvements in detection strategies and evaluation paradigms.

Abstract

As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed--either through deliberate hallucination induction or simulated interactions--rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.

Detecting Hallucinations in Authentic LLM-Human Interactions

TL;DR

This work introduces AuthenHallu, the first hallucination-detection benchmark constructed entirely from authentic LLM–human interactions, addressing the realism gap in prior benchmarks. It provides a corpus of 400 authentic dialogues (800 query–response pairs) annotated for hallucination occurrence and categorized into input-, context-, and fact-conflicting types, with 31.4% of pairs hallucinated and 60% of hallucinations in Math & Number Problems. The authors evaluate six vanilla LLMs in zero-shot settings across detection and categorization tasks, plus ensemble and in-context variants, finding that vanilla detectors show limited reliability (best F1 around 64% in detection) and that ensembles do not consistently surpass the best single model. They also reveal that categorization is particularly challenging, with substantial cross-model variability and fact-conflicting hallucinations being relatively easier to detect. Overall, AuthenHallu provides a realistic, topic-sensitive benchmark that highlights the current limitations of vanilla LLMs for hallucination detection and categorization in authentic usage, informing future improvements in detection strategies and evaluation paradigms.

Abstract

As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed--either through deliberate hallucination induction or simulated interactions--rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.

Paper Structure

This paper contains 54 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: AuthenHallu benchmark construction procedure. In stage 1, we select representative dialogues through filtering and clustering, while in stage 2, we conduct human annotation to assess hallucination occurrence and category.
  • Figure 2: Hallucination rate across different topics. The figure breaks down hallucinations into three types: fact-conflicting, input-conflicting, and context-conflicting hallucinations. Tasks involving numerical reasoning or temporal understanding demonstrate the highest rates.
  • Figure 3: The changing curves of silhouette score and inertia under different cluster numbers.
  • Figure 4: Topic naming instruction for GPT-4o
  • Figure 5: Prompt of single-model detection.
  • ...and 2 more figures