Table of Contents
Fetching ...

"AGI" team at SHROOM-CAP: Data-Centric Approach to Multilingual Hallucination Detection using XLM-RoBERTa

Harsh Rathva, Pruthwik Mishra, Shrikant Malviya

TL;DR

The paper addresses multilingual hallucination detection in scientific text by tackling data scarcity through a data-centric approach. It unifies five existing datasets into a large, balanced corpus of 124,821 samples and fine-tunes XLM-RoBERTa-Large, achieving competitive results across 9 languages and a 2nd-place finish for Gujarati in zero-shot settings. Key contributions include the dataset unification and balancing strategy, open release of data and code, and a demonstration that data quality and distribution can outweigh architectural complexity. A notable gap between validation and competition performance is attributed to distribution shifts and domain mismatch, guiding future work toward domain adaptation and targeted data generation. These findings underscore data-centric AI as a practical path to robust multilingual hallucination detection in low-resource settings.

Abstract

The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including \textbf{2nd place in Gujarati} (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.

"AGI" team at SHROOM-CAP: Data-Centric Approach to Multilingual Hallucination Detection using XLM-RoBERTa

TL;DR

The paper addresses multilingual hallucination detection in scientific text by tackling data scarcity through a data-centric approach. It unifies five existing datasets into a large, balanced corpus of 124,821 samples and fine-tunes XLM-RoBERTa-Large, achieving competitive results across 9 languages and a 2nd-place finish for Gujarati in zero-shot settings. Key contributions include the dataset unification and balancing strategy, open release of data and code, and a demonstration that data quality and distribution can outweigh architectural complexity. A notable gap between validation and competition performance is attributed to distribution shifts and domain mismatch, guiding future work toward domain adaptation and targeted data generation. These findings underscore data-centric AI as a practical path to robust multilingual hallucination detection in low-resource settings.

Abstract

The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including \textbf{2nd place in Gujarati} (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.

Paper Structure

This paper contains 14 sections, 6 tables.