Table of Contents
Fetching ...

Evaluating Evidence Attribution in Generated Fact Checking Explanations

Rui Xing, Timothy Baldwin, Jey Han Lau

TL;DR

This work tackles the trustworthiness of generated explanations in automated fact-checking by focusing on evidence attribution. It introduces a citation masking and recovery protocol to evaluate how accurately explanations attribute statements to cited evidence, and validates it with human annotators and automated methods. The study shows that while large language models (LLMs) can align with human judgments, even the best models produce imperfect attributions, and human-curated evidence substantially improves attribution quality. Collectively, the findings highlight the need for careful evidence selection and point to promising directions for making fact-checking explanations more transparent and trustworthy, including leveraging LLM-based annotators that closely track human judgments.

Abstract

Automated fact-checking systems often struggle with trustworthiness, as their generated explanations can include hallucinations. In this work, we explore evidence attribution for fact-checking explanation generation. We introduce a novel evaluation protocol -- citation masking and recovery -- to assess attribution quality in generated explanations. We implement our protocol using both human annotators and automatic annotators, and find that LLM annotation correlates with human annotation, suggesting that attribution assessment can be automated. Finally, our experiments reveal that: (1) the best-performing LLMs still generate explanations with inaccurate attributions; and (2) human-curated evidence is essential for generating better explanations. Code and data are available here: https://github.com/ruixing76/Transparent-FCExp.

Evaluating Evidence Attribution in Generated Fact Checking Explanations

TL;DR

This work tackles the trustworthiness of generated explanations in automated fact-checking by focusing on evidence attribution. It introduces a citation masking and recovery protocol to evaluate how accurately explanations attribute statements to cited evidence, and validates it with human annotators and automated methods. The study shows that while large language models (LLMs) can align with human judgments, even the best models produce imperfect attributions, and human-curated evidence substantially improves attribution quality. Collectively, the findings highlight the need for careful evidence selection and point to promising directions for making fact-checking explanations more transparent and trustworthy, including leveraging LLM-based annotators that closely track human judgments.

Abstract

Automated fact-checking systems often struggle with trustworthiness, as their generated explanations can include hallucinations. In this work, we explore evidence attribution for fact-checking explanation generation. We introduce a novel evaluation protocol -- citation masking and recovery -- to assess attribution quality in generated explanations. We implement our protocol using both human annotators and automatic annotators, and find that LLM annotation correlates with human annotation, suggesting that attribution assessment can be automated. Finally, our experiments reveal that: (1) the best-performing LLMs still generate explanations with inaccurate attributions; and (2) human-curated evidence is essential for generating better explanations. Code and data are available here: https://github.com/ruixing76/Transparent-FCExp.
Paper Structure (38 sections, 5 equations, 11 figures, 14 tables)

This paper contains 38 sections, 5 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Automated explanation generation in fact-checking. Given a claim, its veracity and a list of evidence passages, a subset of these passages is selected, either by humans or machines, and input into a large language model (LLM) along with the claim to generate the explanation.
  • Figure 2: Our human assessment protocol: citation masking and citation recovery. Given the generated explanation and a list of evidence passages, we randomly masked one sentence $e_k$ and mask its inline citation marker. Annotators are then required to perform citation recovery and predict the masked citation sentence.
  • Figure 3: The inter-annotator agreement for the human annotations. "Evi Src" indicates the whether the evidence is Human- or Machine-selected.
  • Figure 4: The agreement between automatic annotators vs. human annotators. "Generator" = models that are also used as (explanation) generators; "$\sim$7b LLM" refers to open-source LLMs with around 7B parameters; "NLI PLM" indicates PLMs used to conduct pairwise NLI, as described in \ref{['para:pairwise_nli']}.
  • Figure 5: Annotation Interface Page 1
  • ...and 6 more figures