Table of Contents
Fetching ...

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

Zhiying Zhu, Yiming Yang, Zhiqing Sun

TL;DR

Problem: Hallucinations undermine reliability of language models in real-world settings; existing benchmarks fail to capture wild, user-driven interactions. Approach: HaluEval-Wild builds a real-world benchmark of 500 adversarially filtered ShareGPT queries, categorized into OoS, CR, IC, BM, and CE; uses retrieval-augmented generation to create reference answers and GPT-4 as a judge for hallucination detection; investigates RAG and self-reflection as mitigation. Contributions: first in-the-wild hallucination benchmark, analysis across diverse models, demonstration that distillation can increase hallucinations, and evidence that RAG and self-reflection can reduce hallucinations. Significance: provides a practical, scalable tool for evaluating and improving the factual integrity and reliability of language systems in realistic user interactions, with open-source resources for the community.

Abstract

Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension of and improving LLM reliability in scenarios reflective of real-world interactions. Our benchmark is available at https://github.com/HaluEval-Wild/HaluEval-Wild.

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

TL;DR

Problem: Hallucinations undermine reliability of language models in real-world settings; existing benchmarks fail to capture wild, user-driven interactions. Approach: HaluEval-Wild builds a real-world benchmark of 500 adversarially filtered ShareGPT queries, categorized into OoS, CR, IC, BM, and CE; uses retrieval-augmented generation to create reference answers and GPT-4 as a judge for hallucination detection; investigates RAG and self-reflection as mitigation. Contributions: first in-the-wild hallucination benchmark, analysis across diverse models, demonstration that distillation can increase hallucinations, and evidence that RAG and self-reflection can reduce hallucinations. Significance: provides a practical, scalable tool for evaluating and improving the factual integrity and reliability of language systems in realistic user interactions, with open-source resources for the community.

Abstract

Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension of and improving LLM reliability in scenarios reflective of real-world interactions. Our benchmark is available at https://github.com/HaluEval-Wild/HaluEval-Wild.
Paper Structure (29 sections, 3 figures, 9 tables)

This paper contains 29 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: The construction pipeline of HaluEval-Wild.
  • Figure 2: Hallucination rates ($\downarrow$) of direct generation, SR (self-reflection), and hinted SR (hinted self-reflection).
  • Figure 3: The distribution of query types across filtered challenging conversations.