Table of Contents
Fetching ...

Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese

Yunqi Xu, Tianchi Cai, Jiyan Jiang, Xierui Song

TL;DR

The paper addresses factual inconsistency in Retrieval Augmented Generation (RAG) by introducing Face4RAG, an LLM-agnostic Chinese FCE benchmark comprising a nine-type error typology and both synthetic and real-world datasets. It reveals that logical fallacy errors are under-detected by existing FCE methods and proposes L-Face4RAG, which combines logic-preserving answer decomposition with fact-logic FCE and uses chain-of-thought prompts to improve evaluation. Extensive experiments show substantial accuracy gains on Face4RAG across synthetic and real-world data, as well as strong generalization to English FCE benchmarks, establishing broad applicability. The work provides valuable resources for robust factuality evaluation in RAG and highlights the importance of incorporating logical structure awareness into FCE methods.

Abstract

The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emph{Face4RAG} for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emph{L-Face4RAG} with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote{\url{https://huggingface.co/datasets/yq27/Face4RAG}\label{link_face4rag}}

Face4RAG: Factual Consistency Evaluation for Retrieval Augmented Generation in Chinese

TL;DR

The paper addresses factual inconsistency in Retrieval Augmented Generation (RAG) by introducing Face4RAG, an LLM-agnostic Chinese FCE benchmark comprising a nine-type error typology and both synthetic and real-world datasets. It reveals that logical fallacy errors are under-detected by existing FCE methods and proposes L-Face4RAG, which combines logic-preserving answer decomposition with fact-logic FCE and uses chain-of-thought prompts to improve evaluation. Extensive experiments show substantial accuracy gains on Face4RAG across synthetic and real-world data, as well as strong generalization to English FCE benchmarks, establishing broad applicability. The work provides valuable resources for robust factuality evaluation in RAG and highlights the importance of incorporating logical structure awareness into FCE methods.

Abstract

The prevailing issue of factual inconsistency errors in conventional Retrieval Augmented Generation (RAG) motivates the study of Factual Consistency Evaluation (FCE). Despite the various FCE methods proposed earlier, these methods are evaluated on datasets generated by specific Large Language Models (LLMs). Without a comprehensive benchmark, it remains unexplored how these FCE methods perform on other LLMs with different error distributions or even unseen error types, as these methods may fail to detect the error types generated by other LLMs. To fill this gap, in this paper, we propose the first comprehensive FCE benchmark \emph{Face4RAG} for RAG independent of the underlying LLM. Our benchmark consists of a synthetic dataset built upon a carefully designed typology for factuality inconsistency error and a real-world dataset constructed from six commonly used LLMs, enabling evaluation of FCE methods on specific error types or real-world error distributions. On the proposed benchmark, we discover the failure of existing FCE methods to detect the logical fallacy, which refers to a mismatch of logic structures between the answer and the retrieved reference. To fix this issue, we further propose a new method called \emph{L-Face4RAG} with two novel designs of logic-preserving answer decomposition and fact-logic FCE. Extensive experiments show L-Face4RAG substantially outperforms previous methods for factual inconsistency detection on a wide range of tasks, notably beyond the RAG task from which it is originally motivated. Both the benchmark and our proposed method are publicly available.\footnote{\url{https://huggingface.co/datasets/yq27/Face4RAG}\label{link_face4rag}}
Paper Structure (18 sections, 5 figures, 11 tables)

This paper contains 18 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: An overview of our proposed FCE benchmark and method, in comparison with prior works. The upper left plot gives an example from RAG task. The lower left plot demonstrates previous FCE method, and the lower middle plot depicts our proposed FCE method L-Face4RAG. The upper right plot shows the procedure of constructing the real-world dataset in our proposed Face4RAG benchmark, which follows the procedure of previous benchmark. The lower right plots illustrates the construction of the synthetic dataset in the Face4RAG benchmark.
  • Figure 2: Error type distribution of factually inconsistent samples in the two datasets of our Face4RAG benchmark.
  • Figure 3: Error type distributions of the six LLMs in our real-world dataset (we omit the 50%$\sim$100% region in type ratio).
  • Figure 4: A few examples for our proposed logic-preserving answer decomposition.
  • Figure 5: The process of logic consistency evaluation.