Few-Shot Optimized Framework for Hallucination Detection in Resource-Limited NLP Systems
Baraa Hikal, Ahmed Nasreldin, Ali Hamdi, Ammar Mohammed
TL;DR
This paper tackles hallucination detection in NLP under data scarcity by introducing DeepSeek Few-shot Optimization to generate high-quality weak labels, followed by data restructuring to suit generative-model training. The workflow includes iterative prompt refinement, a 30k-strong weak-label corpus, and LoRA-based fine-tuning of Mistral-7B-Instruct-v0.3, achieving strong downstream performance. An ensemble of seven fine-tuned checkpoints via majority voting delivers a test accuracy of 85.5% and secures the top position on the model-agnostic SHROOM track, demonstrating robustness and scalability in resource-limited settings. The results highlight the value of aligning data formats with model capabilities and leveraging weak supervision and ensemble strategies to push state-of-the-art in hallucination detection, with potential extensions to multilingual and cross-task applications.
Abstract
Hallucination detection in text generation remains an ongoing struggle for natural language processing (NLP) systems, frequently resulting in unreliable outputs in applications such as machine translation and definition modeling. Existing methods struggle with data scarcity and the limitations of unlabeled datasets, as highlighted by the SHROOM shared task at SemEval-2024. In this work, we propose a novel framework to address these challenges, introducing DeepSeek Few-shot optimization to enhance weak label generation through iterative prompt engineering. We achieved high-quality annotations that considerably enhanced the performance of downstream models by restructuring data to align with instruct generative models. We further fine-tuned the Mistral-7B-Instruct-v0.3 model on these optimized annotations, enabling it to accurately detect hallucinations in resource-limited settings. Combining this fine-tuned model with ensemble learning strategies, our approach achieved 85.5% accuracy on the test set, setting a new benchmark for the SHROOM task. This study demonstrates the effectiveness of data restructuring, few-shot optimization, and fine-tuning in building scalable and robust hallucination detection frameworks for resource-constrained NLP systems.
