Multilingual Fine-Grained News Headline Hallucination Detection
Jiaming Shen, Tianqi Liu, Jialu Liu, Zhen Qin, Jay Pavagadhi, Simon Baumgartner, Michael Bendersky
TL;DR
This paper tackles hallucination in multilingual news headlines by introducing MFHHD, the first dataset to provide fine-grained, language-aware entailment annotations for 11,469 article-headline pairs across five languages. It demonstrates that supervised fine-tuning benefits from natural language inference pretraining and incorporating explanations, with the best model (mT5_xxl + NLI + Exp) reaching around 74% coarse accuracy and about 67% Example-F1 for fine-grained detection. In the few-shot setting, the authors propose language-dependent demonstration selection and coarse-to-fine prompting to improve in-context learning, showing improvements for PaLM2-L and GPT-4, though these methods still lag behind the best supervised approaches. The work contributes a valuable resource and actionable insights for multilingual, fine-grained headline hallucination detection, with practical implications for improving the faithfulness of automated headlines across languages.
Abstract
The popularity of automated news headline generation has surged with advancements in pre-trained language models. However, these models often suffer from the ``hallucination'' problem, where the generated headline is not fully supported by its source article. Efforts to address this issue have predominantly focused on English, using over-simplistic classification schemes that overlook nuanced hallucination types. In this study, we introduce the first multilingual, fine-grained news headline hallucination detection dataset that contains over 11 thousand pairs in 5 languages, each annotated with detailed hallucination types by experts. We conduct extensive experiments on this dataset under two settings. First, we implement several supervised fine-tuning approaches as preparatory solutions and demonstrate this dataset's challenges and utilities. Second, we test various large language models' in-context learning abilities and propose two novel techniques, language-dependent demonstration selection and coarse-to-fine prompting, to boost the few-shot hallucination detection performance in terms of the example-F1 metric. We release this dataset to foster further research in multilingual, fine-grained headline hallucination detection.
