Table of Contents
Fetching ...

HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation

Loris Bergeron, Ioana Buhnila, Jérôme François, Radu State

TL;DR

HalluGuard introduces a 4B-parameter Small Reasoning Model (SRM) designed to mitigate hallucinations in Retrieval-Augmented Generation by producing evidence-grounded justifications. The approach combines a domain-agnostic synthetic data pipeline (HalluClaim) with multi-stage curation, prompt-guided data reformulation, and a two-generator preference setup (ORPO-enabled LoRA fine-tuning of a Qwen-4B backbone) to distill large-model reasoning into a compact, scalable solution. On the LLM-AggreFact benchmark, HalluGuard-4B achieves ${BAcc}=75.7\%$, rivaling larger models, and on RAGTruth reaches ${BAcc}=84.0\%$ with high true negative and true positive rates, demonstrating strong grounding and explainability. Ablation studies highlight the critical roles of reasoning traces, consensus filtering, and preference alignment in boosting performance, all while providing interpretable, evidence-backed justifications suitable for enterprise RAG deployments. The work also emphasizes reproducibility and plans for open-release of HalluGuard and related datasets under Apache 2.0, aiming to advance trustworthy, transparent retrieval-augmented systems.

Abstract

Large Language Models (LLMs) excel in many NLP tasks but remain prone to hallucinations, limiting trust in real-world applications. We present HalluGuard, a 4B-parameter Small Reasoning Model (SRM) for mitigating hallucinations in Retrieval-Augmented Generation (RAG). HalluGuard classifies document-claim pairs as grounded or hallucinated and produces evidence-grounded justifications for transparency. Our approach combines (i) a domain-agnostic synthetic dataset derived from FineWeb and refined through multi-stage curation and data reformation, (ii) synthetic grounded and hallucinated claims, and (iii) preference-based fine-tuning with Odds Ratio Preference Optimization to distill large-model reasoning into a smaller backbone. On the RAGTruth subset of the LLM-AggreFact benchmark, HalluGuard achieves 84.0% balanced accuracy (BAcc), rivaling specialized models, MiniCheck (7B; 84.0%) and Granite Guardian 3.3 (8B; 82.2%) while using roughly half their parameters. Over the full benchmark it reaches 75.7% BAcc, matching larger general-purpose LLMs such as GPT-4o (75.9%). We will release HalluGuard and datasets under Apache 2.0 upon acceptance.

HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation

TL;DR

HalluGuard introduces a 4B-parameter Small Reasoning Model (SRM) designed to mitigate hallucinations in Retrieval-Augmented Generation by producing evidence-grounded justifications. The approach combines a domain-agnostic synthetic data pipeline (HalluClaim) with multi-stage curation, prompt-guided data reformulation, and a two-generator preference setup (ORPO-enabled LoRA fine-tuning of a Qwen-4B backbone) to distill large-model reasoning into a compact, scalable solution. On the LLM-AggreFact benchmark, HalluGuard-4B achieves , rivaling larger models, and on RAGTruth reaches with high true negative and true positive rates, demonstrating strong grounding and explainability. Ablation studies highlight the critical roles of reasoning traces, consensus filtering, and preference alignment in boosting performance, all while providing interpretable, evidence-backed justifications suitable for enterprise RAG deployments. The work also emphasizes reproducibility and plans for open-release of HalluGuard and related datasets under Apache 2.0, aiming to advance trustworthy, transparent retrieval-augmented systems.

Abstract

Large Language Models (LLMs) excel in many NLP tasks but remain prone to hallucinations, limiting trust in real-world applications. We present HalluGuard, a 4B-parameter Small Reasoning Model (SRM) for mitigating hallucinations in Retrieval-Augmented Generation (RAG). HalluGuard classifies document-claim pairs as grounded or hallucinated and produces evidence-grounded justifications for transparency. Our approach combines (i) a domain-agnostic synthetic dataset derived from FineWeb and refined through multi-stage curation and data reformation, (ii) synthetic grounded and hallucinated claims, and (iii) preference-based fine-tuning with Odds Ratio Preference Optimization to distill large-model reasoning into a smaller backbone. On the RAGTruth subset of the LLM-AggreFact benchmark, HalluGuard achieves 84.0% balanced accuracy (BAcc), rivaling specialized models, MiniCheck (7B; 84.0%) and Granite Guardian 3.3 (8B; 82.2%) while using roughly half their parameters. Over the full benchmark it reaches 75.7% BAcc, matching larger general-purpose LLMs such as GPT-4o (75.9%). We will release HalluGuard and datasets under Apache 2.0 upon acceptance.

Paper Structure

This paper contains 49 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: HalluGuard Concept. Given a document $x$ and a claim $c$, the model first thinks before classifying their relationship as grounded or hallucinated, and then produces a justification citing relevant parts of $x$.
  • Figure 2: Examples of Relations. A grounded claim, an intrinsic hallucination, and an extrinsic hallucination.
  • Figure 3: HalluGuard Training Pipeline. A domain-agnostic corpus is filtered, reformed, and used to generate three types of synthetic claims (grounded, intrinsic hallucinated, and extrinsic hallucinated). Preference data are built via cross-model generation (Qwen3-32B and Qwen3-0.6B), model-agreement verification and LLM-based consensus filtering are used to enhance quality and confidence. The Qwen3-4B backbone is then fine-tuned using LoRA and ORPO to mitigate hallucinations and produce evidence-grounded justifications in RAG applications.
  • Figure 4: Ablation of HalluGuard-4B. Comparison of the full model and three variants on LLM-AggreFact.
  • Figure 5: Effect of Model Reasoning. Radar plot comparing HalluGuard in think mode (lighter blue) vs. in /no_think mode (darker blue).
  • ...and 1 more figures