Table of Contents
Fetching ...

FACTOID: FACtual enTailment fOr hallucInation Detection

Vipula Rawte, S. M Towhidul Islam Tonmoy, Krishnav Rajbangshi, Shravani Nag, Aman Chadha, Amit P. Sheth, Amitava Das

TL;DR

This work tackles factual hallucinations in generated text by introducing Factual Entailment (FE), a task that combines entailment with span-level verification to pinpoint where a claim contradicts reality. It introduces FACTOID, a FE benchmark extending HiLT with 2M training pairs and 89,610 category-specific hallucination samples, plus annotations for span-level refutations. The authors propose a multi-task learning framework that leverages long-text embeddings and span-based textual entailment (using SpanBERT and RoFormer) to jointly perform entailment, span detection, and hallucination classification, achieving substantial gains over traditional TE. They also introduce HVI_auto, an automated hallucination vulnerability index that ranks 15 modern LLMs by their propensity to hallucinate across BN, TI, IF, and P categories, providing a practical tool for evaluation and model selection.

Abstract

The widespread adoption of Large Language Models (LLMs) has facilitated numerous benefits. However, hallucination is a significant concern. In response, Retrieval Augmented Generation (RAG) has emerged as a highly promising paradigm to improve LLM outputs by grounding them in factual information. RAG relies on textual entailment (TE) or similar methods to check if the text produced by LLMs is supported or contradicted, compared to retrieved documents. This paper argues that conventional TE methods are inadequate for spotting hallucinations in content generated by LLMs. For instance, consider a prompt about the 'USA's stance on the Ukraine war''. The AI-generated text states, ...U.S. President Barack Obama says the U.S. will not put troops in Ukraine...'' However, during the war the U.S. president is Joe Biden which contradicts factual reality. Moreover, current TE systems are unable to accurately annotate the given text and identify the exact portion that is contradicted. To address this, we introduces a new type of TE called ``Factual Entailment (FE).'', aims to detect factual inaccuracies in content generated by LLMs while also highlighting the specific text segment that contradicts reality. We present FACTOID (FACTual enTAILment for hallucInation Detection), a benchmark dataset for FE. We propose a multi-task learning (MTL) framework for FE, incorporating state-of-the-art (SoTA) long text embeddings such as e5-mistral-7b-instruct, along with GPT-3, SpanBERT, and RoFormer. The proposed MTL architecture for FE achieves an avg. 40\% improvement in accuracy on the FACTOID benchmark compared to SoTA TE methods. As FE automatically detects hallucinations, we assessed 15 modern LLMs and ranked them using our proposed Auto Hallucination Vulnerability Index (HVI_auto). This index quantifies and offers a comparative scale to evaluate and rank LLMs according to their hallucinations.

FACTOID: FACtual enTailment fOr hallucInation Detection

TL;DR

This work tackles factual hallucinations in generated text by introducing Factual Entailment (FE), a task that combines entailment with span-level verification to pinpoint where a claim contradicts reality. It introduces FACTOID, a FE benchmark extending HiLT with 2M training pairs and 89,610 category-specific hallucination samples, plus annotations for span-level refutations. The authors propose a multi-task learning framework that leverages long-text embeddings and span-based textual entailment (using SpanBERT and RoFormer) to jointly perform entailment, span detection, and hallucination classification, achieving substantial gains over traditional TE. They also introduce HVI_auto, an automated hallucination vulnerability index that ranks 15 modern LLMs by their propensity to hallucinate across BN, TI, IF, and P categories, providing a practical tool for evaluation and model selection.

Abstract

The widespread adoption of Large Language Models (LLMs) has facilitated numerous benefits. However, hallucination is a significant concern. In response, Retrieval Augmented Generation (RAG) has emerged as a highly promising paradigm to improve LLM outputs by grounding them in factual information. RAG relies on textual entailment (TE) or similar methods to check if the text produced by LLMs is supported or contradicted, compared to retrieved documents. This paper argues that conventional TE methods are inadequate for spotting hallucinations in content generated by LLMs. For instance, consider a prompt about the 'USA's stance on the Ukraine war''. The AI-generated text states, ...U.S. President Barack Obama says the U.S. will not put troops in Ukraine...'' However, during the war the U.S. president is Joe Biden which contradicts factual reality. Moreover, current TE systems are unable to accurately annotate the given text and identify the exact portion that is contradicted. To address this, we introduces a new type of TE called ``Factual Entailment (FE).'', aims to detect factual inaccuracies in content generated by LLMs while also highlighting the specific text segment that contradicts reality. We present FACTOID (FACTual enTAILment for hallucInation Detection), a benchmark dataset for FE. We propose a multi-task learning (MTL) framework for FE, incorporating state-of-the-art (SoTA) long text embeddings such as e5-mistral-7b-instruct, along with GPT-3, SpanBERT, and RoFormer. The proposed MTL architecture for FE achieves an avg. 40\% improvement in accuracy on the FACTOID benchmark compared to SoTA TE methods. As FE automatically detects hallucinations, we assessed 15 modern LLMs and ranked them using our proposed Auto Hallucination Vulnerability Index (HVI_auto). This index quantifies and offers a comparative scale to evaluate and rank LLMs according to their hallucinations.
Paper Structure (39 sections, 1 equation, 7 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 1 equation, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Utilizing longer embeddings for extended sentences is advantageous. The cosine similarities are more prominent in Jina embeddings günther2023jina compared to MiniLLMgu2023knowledge. Consequently, the cosine similarity for the pair (sent1, sent2) increases from 0.76 to 0.93, as indicated by the green dashed line.
  • Figure 2: A summary of the overall multi-task learning framework for Factual Entailment. The framework encompasses three tasks: i) entailment, ii) span detection, and iii) hallucination classification.
  • Figure 3: Results showing how FE performs better than TE at detecting hallucination in six different categories.
  • Figure 4: HVI for different hallucination categories across various LLMs.
  • Figure 5: The HVI scale illustrates the hallucination tendencies exhibited by various LLMs.
  • ...and 2 more figures