keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection
Saketh Reddy Vemula, Parameswari Krishnamurthy
TL;DR
This work tackles the challenge of locating hallucination spans in multilingual LLM outputs by proposing a zero-resource, black-box approach that exploits uncertainty in stochastically sampled responses. The method segments text into spans, matches them to diverse samples, and computes a hallucination score S_h(s_i)=\alpha H_s(s_i)+\beta H_l(s_i)+\gamma F(s_i) with $\alpha=0.4$, $\beta=0.4$, $\gamma=0.2$, using semantic entropy, lexical entropy, and a frequency-based score. It refines spans, merges overlaps, and outputs hallucination spans above a threshold, enabling precise localization across 14 languages in the Mu-SHROOM task. Experimental results show competitive IoU and Cor across languages, with strong performance in Basque, Finnish, Italian, and Hindi, demonstrating cross-linguistic applicability though some false positives remain due to sampling noise. Limitations include the lack of supervised training; future work suggests fine-tuning on labeled data and adding factual verification to further improve reliability in real-world AI systems.
Abstract
Identification of hallucination spans in black-box language model generated text is essential for applications in the real world. A recent attempt at this direction is SemEval-2025 Task 3, Mu-SHROOM-a Multilingual Shared Task on Hallucinations and Related Observable Over-generation Errors. In this work, we present our solution to this problem, which capitalizes on the variability of stochastically-sampled responses in order to identify hallucinated spans. Our hypothesis is that if a language model is certain of a fact, its sampled responses will be uniform, while hallucinated facts will yield different and conflicting results. We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments. Our method is not dependent on additional training and hence is cost-effective and adaptable. In addition, we conduct extensive hyperparameter tuning and perform error analysis, giving us crucial insights into model behavior.
