Table of Contents
Fetching ...

FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data

Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng

TL;DR

This work introduces CG2C, a graph-based synthetic data approach that enables controlled, multi-hop reasoning for factuality detection in LLM outputs. The FactCG fact-checker, trained with CG2C-generated data from both multi-hop QA and documents, achieves state-of-the-art results among models of comparable size and even surpasses GPT-4-o on the LLM-A GGREF ACT benchmark in some settings. By avoiding LLM-based label generation, FactCG attains strong performance with smaller models and exhibits more connected reasoning than artifact-based baselines. The study also provides a connected reasoning evaluation (CoRe) showing improved, but not perfect, multi-sentence integration, and discusses limitations in graph extraction, data scale, and chunking strategies with clear directions for future work.

Abstract

Prior research on training grounded factuality classification models to detect hallucinations in large language models (LLMs) has relied on public natural language inference (NLI) data and synthetic data. However, conventional NLI datasets are not well-suited for document-level reasoning, which is critical for detecting LLM hallucinations. Recent approaches to document-level synthetic data generation involve iteratively removing sentences from documents and annotating factuality using LLM-based prompts. While effective, this method is computationally expensive for long documents and limited by the LLM's capabilities. In this work, we analyze the differences between existing synthetic training data used in state-of-the-art models and real LLM output claims. Based on our findings, we propose a novel approach for synthetic data generation, CG2C, that leverages multi-hop reasoning on context graphs extracted from documents. Our fact checker model, FactCG, demonstrates improved performance with more connected reasoning, using the same backbone models. Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark with much smaller model size.

FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data

TL;DR

This work introduces CG2C, a graph-based synthetic data approach that enables controlled, multi-hop reasoning for factuality detection in LLM outputs. The FactCG fact-checker, trained with CG2C-generated data from both multi-hop QA and documents, achieves state-of-the-art results among models of comparable size and even surpasses GPT-4-o on the LLM-A GGREF ACT benchmark in some settings. By avoiding LLM-based label generation, FactCG attains strong performance with smaller models and exhibits more connected reasoning than artifact-based baselines. The study also provides a connected reasoning evaluation (CoRe) showing improved, but not perfect, multi-sentence integration, and discusses limitations in graph extraction, data scale, and chunking strategies with clear directions for future work.

Abstract

Prior research on training grounded factuality classification models to detect hallucinations in large language models (LLMs) has relied on public natural language inference (NLI) data and synthetic data. However, conventional NLI datasets are not well-suited for document-level reasoning, which is critical for detecting LLM hallucinations. Recent approaches to document-level synthetic data generation involve iteratively removing sentences from documents and annotating factuality using LLM-based prompts. While effective, this method is computationally expensive for long documents and limited by the LLM's capabilities. In this work, we analyze the differences between existing synthetic training data used in state-of-the-art models and real LLM output claims. Based on our findings, we propose a novel approach for synthetic data generation, CG2C, that leverages multi-hop reasoning on context graphs extracted from documents. Our fact checker model, FactCG, demonstrates improved performance with more connected reasoning, using the same backbone models. Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark with much smaller model size.

Paper Structure

This paper contains 34 sections, 4 equations, 2 figures, 17 tables.

Figures (2)

  • Figure 1: Context Graphs in LLM Responses. Red nodes represent contextualized entities that requires further bridging for coreference resolution.
  • Figure 2: CG2C from Document. To generate synthetic data with documents only, first we construct context graph $\mathcal{G}$ from document $doc$. Second we extract sub-graph $\mathcal{G}_c$ from the context graph as multi-hop. Third we generate claim $c$ with context $\mathcal{G}_c$. To corrupt the document we remove a random relation between entities within $\mathcal{G}_c$ to get $doc_{neg}$. Finally, we get positive sample $\langle doc, c \rangle$ and negative sample $\langle doc_{neg}, c \rangle$. For CG2C-MHQA, we get $c$ from $\langle q, ans \rangle$ and use $\langle \mathcal{G}, c \rangle$ to get $\mathcal{G}_c$ instead of step 2 and 3 mentioned above.