Table of Contents
Fetching ...

C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

Xu Zhang, Zhifei Liu, Jiahao Wang, Huixuan Zhang, Fan Xu, Junzhe Zhang, Xiaojun Wan

TL;DR

This paper tackles the challenge of evaluating hallucinations in Chinese LLMs by introducing HaluAgent, an automated, agentic pipeline that generates fine-grained QA data from knowledge documents and verifies correctness with explicit rules. Building on HaluAgent, the authors present C-FAITH, a large Chinese hallucination benchmark derived from 1,399 documents and comprising multiple QA formats (generative, single-choice, true/false) totaling tens of thousands of items. The framework demonstrates notable improvements in data quality through prompt optimization and verification, and the benchmark reveals that larger models generally hallucinate less, though hallucination type and input content significantly influence rates. C-FAITH enables scalable, targeted evaluation of hallucination across domains, offering a fine-grained, practical tool for advancing robust Chinese NLP systems. These contributions have practical impact for model development, safety assessment, and benchmarking in non-English settings.

Abstract

Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.

C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

TL;DR

This paper tackles the challenge of evaluating hallucinations in Chinese LLMs by introducing HaluAgent, an automated, agentic pipeline that generates fine-grained QA data from knowledge documents and verifies correctness with explicit rules. Building on HaluAgent, the authors present C-FAITH, a large Chinese hallucination benchmark derived from 1,399 documents and comprising multiple QA formats (generative, single-choice, true/false) totaling tens of thousands of items. The framework demonstrates notable improvements in data quality through prompt optimization and verification, and the benchmark reveals that larger models generally hallucinate less, though hallucination type and input content significantly influence rates. C-FAITH enables scalable, targeted evaluation of hallucination across domains, offering a fine-grained, practical tool for advancing robust Chinese NLP systems. These contributions have practical impact for model development, safety assessment, and benchmarking in non-English settings.

Abstract

Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.

Paper Structure

This paper contains 19 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The total hallucination rates of 16 tested LLMs on C-FAITH.
  • Figure 2: An example of the created QA data. HaluAgent first generates meta data containing question, correct answer, hallucinated responses and hallucination labels. Then, the QA meta data is aggregated into three different formats for hallucination evaluation. We provide both the Chinese QA data and the English translation in the figure.
  • Figure 3: An illustration of our proposed HaluAgent framework. HaluAgent consists of three modules, including the generation module, the verification module and the optimization module. With manually designed rules, HaluAgent first conducts prompt optimization based on error feedbacks from the verification module. Next, HaluAgent takes knowledge documents as input to generate fine-grained QA data for hallucination evaluation.
  • Figure 4: The prompt optimization case of spatiotemporal error definition in hallucination label generation.
  • Figure 5: The false negative and false positive error rates of LLMs when facing true-or-false questions.
  • ...and 2 more figures