Table of Contents
Fetching ...

Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation

Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang

TL;DR

Bi'an tackles hallucination in Retrieval-Augmented Generation by introducing Bi'anBench, a bilingual EN/ZH benchmark, and lightweight judge models. It encompasses four RAG tasks and uses hallucination perturbation and counterfactual QA pipelines to generate 22,992 test cases, enabling rigorous evaluation. The models are trained with a two-stage regime (SFT with LoRA followed by DPO) on Qwen2.5 7B/14B and benefit from ensemble-based data construction, with the 14B Bi'an model approaching GPT-4o performance at a lower cost. The work also analyzes knowledge conflicts in counterfactual settings, demonstrates ablations showing the value of SFT, and discusses limitations, including sample loss and the exclusion of creative-writing tasks, with data and models to be released publicly.

Abstract

Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.

Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation

TL;DR

Bi'an tackles hallucination in Retrieval-Augmented Generation by introducing Bi'anBench, a bilingual EN/ZH benchmark, and lightweight judge models. It encompasses four RAG tasks and uses hallucination perturbation and counterfactual QA pipelines to generate 22,992 test cases, enabling rigorous evaluation. The models are trained with a two-stage regime (SFT with LoRA followed by DPO) on Qwen2.5 7B/14B and benefit from ensemble-based data construction, with the 14B Bi'an model approaching GPT-4o performance at a lower cost. The work also analyzes knowledge conflicts in counterfactual settings, demonstrates ablations showing the value of SFT, and discusses limitations, including sample loss and the exclusion of creative-writing tasks, with data and models to be released publicly.

Abstract

Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.

Paper Structure

This paper contains 25 sections, 2 figures, 22 tables.

Figures (2)

  • Figure 1: An overview of the Bi'an framework, including Bi'anBench and Bi'an Model. In Chinese mythology, Bi'an is the offspring of a dragon and a tiger, a mythical creature capable of discerning right from wrong, thus aligning with the scenario of RAG hallucination detection.
  • Figure 2: Ablation studies on the training phase of the Bi'an models.