Table of Contents
Fetching ...

Span-Level Hallucination Detection for LLM-Generated Answers

Passant Elchafei, Mervet Abu-Elkheir

TL;DR

The paper tackles span-level hallucination detection in LLM-generated answers for English and Arabic by decomposing responses into semantic units via SRL, grounding them with retrieved context, and assessing factuality with a DeBERTa entailment model plus token-level confidence. A score integration step combines entailment and confidence into a refined score, enabling precise identification of hallucinated spans, with a threshold used to flag them. Evaluations on the Mu-SHROOM dataset show competitive English performance (IoU ≈ 0.358, Cor ≈ 0.322) and varied Arabic results, with dependency parsing before SRL improving robustness. LLM-based fact-checking (GPT-4 and LLaMA) provides additional verification, highlighting language-specific challenges and the value of multilingual, linguistically informed detection for improving factual consistency in real-world applications.

Abstract

Detecting spans of hallucination in LLM-generated answers is crucial for improving factual consistency. This paper presents a span-level hallucination detection framework for the SemEval-2025 Shared Task, focusing on English and Arabic texts. Our approach integrates Semantic Role Labeling (SRL) to decompose the answer into atomic roles, which are then compared with a retrieved reference context obtained via question-based LLM prompting. Using a DeBERTa-based textual entailment model, we evaluate each role semantic alignment with the retrieved context. The entailment scores are further refined through token-level confidence measures derived from output logits, and the combined scores are used to detect hallucinated spans. Experiments on the Mu-SHROOM dataset demonstrate competitive performance. Additionally, hallucinated spans have been verified through fact-checking by prompting GPT-4 and LLaMA. Our findings contribute to improving hallucination detection in LLM-generated responses.

Span-Level Hallucination Detection for LLM-Generated Answers

TL;DR

The paper tackles span-level hallucination detection in LLM-generated answers for English and Arabic by decomposing responses into semantic units via SRL, grounding them with retrieved context, and assessing factuality with a DeBERTa entailment model plus token-level confidence. A score integration step combines entailment and confidence into a refined score, enabling precise identification of hallucinated spans, with a threshold used to flag them. Evaluations on the Mu-SHROOM dataset show competitive English performance (IoU ≈ 0.358, Cor ≈ 0.322) and varied Arabic results, with dependency parsing before SRL improving robustness. LLM-based fact-checking (GPT-4 and LLaMA) provides additional verification, highlighting language-specific challenges and the value of multilingual, linguistically informed detection for improving factual consistency in real-world applications.

Abstract

Detecting spans of hallucination in LLM-generated answers is crucial for improving factual consistency. This paper presents a span-level hallucination detection framework for the SemEval-2025 Shared Task, focusing on English and Arabic texts. Our approach integrates Semantic Role Labeling (SRL) to decompose the answer into atomic roles, which are then compared with a retrieved reference context obtained via question-based LLM prompting. Using a DeBERTa-based textual entailment model, we evaluate each role semantic alignment with the retrieved context. The entailment scores are further refined through token-level confidence measures derived from output logits, and the combined scores are used to detect hallucinated spans. Experiments on the Mu-SHROOM dataset demonstrate competitive performance. Additionally, hallucinated spans have been verified through fact-checking by prompting GPT-4 and LLaMA. Our findings contribute to improving hallucination detection in LLM-generated responses.

Paper Structure

This paper contains 15 sections, 2 equations, 2 figures.

Figures (2)

  • Figure 1: Span-Level Hallucination Detection Framework
  • Figure 2: Example of Arabic sentence SRL extraction