Table of Contents
Fetching ...

CiteCheck: Towards Accurate Citation Faithfulness Detection

Ziyao Xu, Shaohang Wei, Zhuoheng Han, Jing Jin, Zhe Yang, Xiaoguang Li, Haochen Tan, Zhijiang Guo, Houfeng Wang

TL;DR

CiteCheck tackles the challenge of citation faithfulness detection in Chinese RAG systems by introducing the first large-scale Chinese dataset built via a cost-efficient two-stage annotation workflow. The authors combine question collection from diverse sources, GPT-4o-assisted data augmentation to generate high-quality negative samples, and careful manual validation to produce balanced training and challenging test sets. Zero-shot evaluations reveal the difficulty of detecting unsupported citations for state-of-the-art LLMs, while parameter-efficient fine-tuning on smaller models achieved strong performance thanks to the augmented training data. This work provides a practical foundation for reliable, citation-grounded Chinese RAG applications and offers a scalable methodology for dataset construction in low-resource language settings.

Abstract

Citation faithfulness detection is critical for enhancing retrieval-augmented generation (RAG) systems, yet large-scale Chinese datasets for this task are scarce. Existing methods face prohibitive costs due to the need for manually annotated negative samples. To address this, we introduce the first large-scale Chinese dataset CiteCheck for citation faithfulness detection, constructed via a cost-effective approach using two-stage manual annotation. This method balances positive and negative samples while significantly reducing annotation expenses. CiteCheck comprises training and test splits. Experiments demonstrate that: (1) the test samples are highly challenging, with even state-of-the-art LLMs failing to achieve high accuracy; and (2) training data augmented with LLM-generated negative samples enables smaller models to attain strong performance using parameter-efficient fine-tuning. CiteCheck provides a robust foundation for advancing citation faithfulness detection in Chinese RAG systems. The dataset is publicly available to facilitate research.

CiteCheck: Towards Accurate Citation Faithfulness Detection

TL;DR

CiteCheck tackles the challenge of citation faithfulness detection in Chinese RAG systems by introducing the first large-scale Chinese dataset built via a cost-efficient two-stage annotation workflow. The authors combine question collection from diverse sources, GPT-4o-assisted data augmentation to generate high-quality negative samples, and careful manual validation to produce balanced training and challenging test sets. Zero-shot evaluations reveal the difficulty of detecting unsupported citations for state-of-the-art LLMs, while parameter-efficient fine-tuning on smaller models achieved strong performance thanks to the augmented training data. This work provides a practical foundation for reliable, citation-grounded Chinese RAG applications and offers a scalable methodology for dataset construction in low-resource language settings.

Abstract

Citation faithfulness detection is critical for enhancing retrieval-augmented generation (RAG) systems, yet large-scale Chinese datasets for this task are scarce. Existing methods face prohibitive costs due to the need for manually annotated negative samples. To address this, we introduce the first large-scale Chinese dataset CiteCheck for citation faithfulness detection, constructed via a cost-effective approach using two-stage manual annotation. This method balances positive and negative samples while significantly reducing annotation expenses. CiteCheck comprises training and test splits. Experiments demonstrate that: (1) the test samples are highly challenging, with even state-of-the-art LLMs failing to achieve high accuracy; and (2) training data augmented with LLM-generated negative samples enables smaller models to attain strong performance using parameter-efficient fine-tuning. CiteCheck provides a robust foundation for advancing citation faithfulness detection in Chinese RAG systems. The dataset is publicly available to facilitate research.

Paper Structure

This paper contains 16 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Examples of interfaces that provide samples to the annotators. The first figure shows an example of the first stage. The last two images show the second stage with the same sample modified (information changed/deleted).