Enhancing Hallucination Detection through Perturbation-Based Synthetic Data Generation in System Responses
Dongxu Zhang, Varun Gangal, Barrett Martin Lattimer, Yi Yang
TL;DR
This work tackles the costly challenge of training hallucination detectors for rapidly evolving LLMs by automatically generating both faithful and hallucinated outputs through a prompt-driven rewriting pipeline. By using a rewriting LLM (GPT-4) to perturb the target system's responses, the authors create a synthetic dataset that closely matches real-world LLM outputs and enables end-to-end training of a T5-base detector. Empirical results on OpenDialKG-Eval and BEGIN show the fine-tuned detector surpasses zero-shot methods and prior synthetic baselines in Macro-F1 and latency, while also reducing annotation costs. The approach yields richer hallucination coverage and demonstrates favorable data quality, with ablations confirming the necessity of including both faithful and hallucination data. Overall, the method offers a cost-effective, adaptable path to improving reliability and safety in LLM-based systems.
Abstract
Detecting hallucinations in large language model (LLM) outputs is pivotal, yet traditional fine-tuning for this classification task is impeded by the expensive and quickly outdated annotation process, especially across numerous vertical domains and in the face of rapid LLM advancements. In this study, we introduce an approach that automatically generates both faithful and hallucinated outputs by rewriting system responses. Experimental findings demonstrate that a T5-base model, fine-tuned on our generated dataset, surpasses state-of-the-art zero-shot detectors and existing synthetic generation methods in both accuracy and latency, indicating efficacy of our approach.
