Table of Contents
Fetching ...

Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

Shrey Pandit, Ashwin Vinod, Liu Leqi, Ying Ding

TL;DR

This work introduces HaluCheck, a curriculum-guided Direct Preference Optimization framework that uses carefully curated hallucinated negatives to train LLMs for hallucination detection. By ranking negatives with MiniCheck grounding scores and progressively exposing the model to harder cases, HaluCheck achieves state-of-the-art-like performance at 1–3B scales on MedHallu and HaluEval, while maintaining strong zero-shot robustness on external benchmarks. The approach significantly outperforms baselines that rely on standard negative samples and larger models, with notable gains on challenging datasets. The results demonstrate the practical potential of curriculum-based alignment for reliable hallucination detection, particularly in settings where computational efficiency matters, though the work acknowledges dependencies on external verifiers and the importance of human oversight in high-stakes use cases.

Abstract

Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various metrics, achieving improvements of upto 24% on difficult benchmarks like MedHallu and HaluEval. Additionally, HaluCheck models demonstrate robustness in zero-shot settings, significantly outperforming larger state-of-the-art models across various benchmarks.

Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

TL;DR

This work introduces HaluCheck, a curriculum-guided Direct Preference Optimization framework that uses carefully curated hallucinated negatives to train LLMs for hallucination detection. By ranking negatives with MiniCheck grounding scores and progressively exposing the model to harder cases, HaluCheck achieves state-of-the-art-like performance at 1–3B scales on MedHallu and HaluEval, while maintaining strong zero-shot robustness on external benchmarks. The approach significantly outperforms baselines that rely on standard negative samples and larger models, with notable gains on challenging datasets. The results demonstrate the practical potential of curriculum-based alignment for reliable hallucination detection, particularly in settings where computational efficiency matters, though the work acknowledges dependencies on external verifiers and the importance of human oversight in high-stakes use cases.

Abstract

Aligning large language models (LLMs) to accurately detect hallucinations remains a significant challenge due to the sophisticated nature of hallucinated text. Recognizing that hallucinated samples typically exhibit higher deceptive quality than traditional negative samples, we use these carefully engineered hallucinations as negative examples in the DPO alignment procedure. Our method incorporates a curriculum learning strategy, gradually transitioning the training from easier samples, identified based on the greatest reduction in probability scores from independent fact checking models, to progressively harder ones. This structured difficulty scaling ensures stable and incremental learning. Experimental evaluation demonstrates that our HaluCheck models, trained with curriculum DPO approach and high quality negative samples, significantly improves model performance across various metrics, achieving improvements of upto 24% on difficult benchmarks like MedHallu and HaluEval. Additionally, HaluCheck models demonstrate robustness in zero-shot settings, significantly outperforming larger state-of-the-art models across various benchmarks.

Paper Structure

This paper contains 29 sections, 3 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of the qualitative difference between standard negative samples used in conventional DPO alignment and our proposed method, which leverages carefully curated hallucinated answers as high-quality negative examples in DPO alignment.
  • Figure 2: Figure showing the pipeline for selecting high‑quality hallucinated negatives for Direct Preference Optimization (DPO). Each question and context is paired with a hallucinated answer and scored for grounded factuality via MiniCheck, then ranked by difficulty. In each batch, gold references (chosen) and top‑ranked hallucinations (rejected) form preference pairs. These pairs optimize the DPO objective, ensuring training against vetted, high‑quality negatives rather than arbitrary failures.
  • Figure 3: Figure showing the grounded factuality of the hallucinated samples from MedHallu dataset. We keep only the samples that have a score above 0.25.