Table of Contents
Fetching ...

DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation

A B M Ashikur Rahman, Saeed Anwar, Muhammad Usman, Ajmal Mian

TL;DR

DefAn tackles the challenge of evaluating LLM hallucinations by delivering a large-scale, domain-diverse benchmark of over 75,000 prompts designed to elicit definitive answers. It targets three core aspects—factual accuracy, faithfulness to the prompt, and consistency across paraphrased prompts—and introduces a public/hidden data split to ensure robust benchmarking. The framework includes eight knowledge domains and a detailed paraphrasing pipeline, with automated evaluation using three metrics: $FCH$, $PMH$, and $RC$. Empirical results across six LLMs show substantial hallucinations, especially on numeric tasks, while Gemini Pro often achieves the best overall performance. The dataset offers a practical, scalable tool for measuring and mitigating hallucinations in real-world generative AI deployments.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications. However, they are prone to hallucinations, generating claims that contradict established facts, deviating from prompts, and producing inconsistent responses when the same prompt is presented multiple times. Addressing these issues is challenging due to the lack of comprehensive and easily assessable benchmark datasets. Most existing datasets are small and rely on multiple-choice questions, which are inadequate for evaluating the generative prowess of LLMs. To measure hallucination in LLMs, this paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains. These prompts are designed to elicit definitive, concise, and informative answers. The dataset is divided into two segments: one publicly available for testing and assessing LLM performance and a hidden segment for benchmarking various LLMs. In our experiments, we tested six LLMs-GPT-3.5, LLama 2, LLama 3, Gemini, Mixtral, and Zephyr-revealing that overall factual hallucination ranges from 59% to 82% on the public dataset and 57% to 76% in the hidden benchmark. Prompt misalignment hallucination ranges from 6% to 95% in the public dataset and 17% to 94% in the hidden counterpart. Average consistency ranges from 21% to 61% and 22% to 63%, respectively. Domain-wise analysis shows that LLM performance significantly deteriorates when asked for specific numeric information while performing moderately with person, location, and date queries. Our dataset demonstrates its efficacy and serves as a comprehensive benchmark for LLM performance evaluation. Our dataset and LLMs responses are available at \href{https://github.com/ashikiut/DefAn}{https://github.com/ashikiut/DefAn}.

DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation

TL;DR

DefAn tackles the challenge of evaluating LLM hallucinations by delivering a large-scale, domain-diverse benchmark of over 75,000 prompts designed to elicit definitive answers. It targets three core aspects—factual accuracy, faithfulness to the prompt, and consistency across paraphrased prompts—and introduces a public/hidden data split to ensure robust benchmarking. The framework includes eight knowledge domains and a detailed paraphrasing pipeline, with automated evaluation using three metrics: , , and . Empirical results across six LLMs show substantial hallucinations, especially on numeric tasks, while Gemini Pro often achieves the best overall performance. The dataset offers a practical, scalable tool for measuring and mitigating hallucinations in real-world generative AI deployments.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications. However, they are prone to hallucinations, generating claims that contradict established facts, deviating from prompts, and producing inconsistent responses when the same prompt is presented multiple times. Addressing these issues is challenging due to the lack of comprehensive and easily assessable benchmark datasets. Most existing datasets are small and rely on multiple-choice questions, which are inadequate for evaluating the generative prowess of LLMs. To measure hallucination in LLMs, this paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains. These prompts are designed to elicit definitive, concise, and informative answers. The dataset is divided into two segments: one publicly available for testing and assessing LLM performance and a hidden segment for benchmarking various LLMs. In our experiments, we tested six LLMs-GPT-3.5, LLama 2, LLama 3, Gemini, Mixtral, and Zephyr-revealing that overall factual hallucination ranges from 59% to 82% on the public dataset and 57% to 76% in the hidden benchmark. Prompt misalignment hallucination ranges from 6% to 95% in the public dataset and 17% to 94% in the hidden counterpart. Average consistency ranges from 21% to 61% and 22% to 63%, respectively. Domain-wise analysis shows that LLM performance significantly deteriorates when asked for specific numeric information while performing moderately with person, location, and date queries. Our dataset demonstrates its efficacy and serves as a comprehensive benchmark for LLM performance evaluation. Our dataset and LLMs responses are available at \href{https://github.com/ashikiut/DefAn}{https://github.com/ashikiut/DefAn}.
Paper Structure (17 sections, 4 figures, 9 tables)

This paper contains 17 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Comparison between different types of hallucinations. a) Fact Contradicting Hallucinations and b) Prompt Misalignment Hallucinations. Best viewed on a zoomed-in screen.
  • Figure 2: Distribution of prompts by domain
  • Figure 3: The performance comparison on all three evaluation metrics for LLMs in a) public and b) hidden datasets.
  • Figure 4: Sample prompt execution. Visualized using openAI playground.