Table of Contents
Fetching ...

MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection

Baraa Hikal, Ahmed Nasreldin, Ali Hamdi

TL;DR

MSA at SemEval-2025 Task 3 tackles multilingual hallucination detection in instruction-tuned LLM outputs by combining task-specific prompt engineering for weak label generation with an LLM ensemble verification mechanism. One model acts as a Span Extractor while three others adjudicate spans via probability-based voting, with iterative model rotation and Consensus-Based Labeling; post-processing with fuzzy matching refines span alignment. The approach achieves strong multilingual performance, ranking highly across languages (e.g., 1st in Arabic and Basque) and illustrating the value of ensemble verification and span refinement in reducing annotation biases. This work advances multilingual hallucination detection without relying on training data, and points to future improvements via external knowledge and tighter span localization to further enhance reliability in multilingual NLP systems.

Abstract

This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages. Our approach combines task-specific prompt engineering with an LLM ensemble verification mechanism, where a primary model extracts hallucination spans and three independent LLMs adjudicate their validity through probability-based voting. This framework simulates the human annotation workflow used in the shared task validation and test data. Additionally, fuzzy matching refines span alignment. Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.

MSA at SemEval-2025 Task 3: High Quality Weak Labeling and LLM Ensemble Verification for Multilingual Hallucination Detection

TL;DR

MSA at SemEval-2025 Task 3 tackles multilingual hallucination detection in instruction-tuned LLM outputs by combining task-specific prompt engineering for weak label generation with an LLM ensemble verification mechanism. One model acts as a Span Extractor while three others adjudicate spans via probability-based voting, with iterative model rotation and Consensus-Based Labeling; post-processing with fuzzy matching refines span alignment. The approach achieves strong multilingual performance, ranking highly across languages (e.g., 1st in Arabic and Basque) and illustrating the value of ensemble verification and span refinement in reducing annotation biases. This work advances multilingual hallucination detection without relying on training data, and points to future improvements via external knowledge and tighter span localization to further enhance reliability in multilingual NLP systems.

Abstract

This paper describes our submission for SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The task involves detecting hallucinated spans in text generated by instruction-tuned Large Language Models (LLMs) across multiple languages. Our approach combines task-specific prompt engineering with an LLM ensemble verification mechanism, where a primary model extracts hallucination spans and three independent LLMs adjudicate their validity through probability-based voting. This framework simulates the human annotation workflow used in the shared task validation and test data. Additionally, fuzzy matching refines span alignment. Our system ranked 1st in Arabic and Basque, 2nd in German, Swedish, and Finnish, and 3rd in Czech, Farsi, and French.

Paper Structure

This paper contains 20 sections, 6 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of our hallucination detection pipeline.
  • Figure 2: Performance rankings of LLMs according to the Vectara Hallucination Leaderboard vectara2024hallucination.
  • Figure 3: Dataset examples in different languages. The hallucinated span(s) are highlighted.