C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation
Xu Zhang, Zhifei Liu, Jiahao Wang, Huixuan Zhang, Fan Xu, Junzhe Zhang, Xiaojun Wan
TL;DR
This paper tackles the challenge of evaluating hallucinations in Chinese LLMs by introducing HaluAgent, an automated, agentic pipeline that generates fine-grained QA data from knowledge documents and verifies correctness with explicit rules. Building on HaluAgent, the authors present C-FAITH, a large Chinese hallucination benchmark derived from 1,399 documents and comprising multiple QA formats (generative, single-choice, true/false) totaling tens of thousands of items. The framework demonstrates notable improvements in data quality through prompt optimization and verification, and the benchmark reveals that larger models generally hallucinate less, though hallucination type and input content significantly influence rates. C-FAITH enables scalable, targeted evaluation of hallucination across domains, offering a fine-grained, practical tool for advancing robust Chinese NLP systems. These contributions have practical impact for model development, safety assessment, and benchmarking in non-English settings.
Abstract
Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.
