Table of Contents
Fetching ...

AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators

Jingwei Ni, Minjing Shi, Dominik Stammbach, Mrinmaya Sachan, Elliott Ash, Markus Leippold

TL;DR

This work tackles two key barriers in factual claim detection—conceptual inconsistency and costly annotation—by introducing a verifiability-based definition of factual claims and AFaCTA, an LLM-assisted annotation framework. AFaCTA uses three prompting steps (Direct Classification, Fact-Extraction CoT, and Reasoning with Debate) followed by a majority-vote aggregation to calibrate reliability via self-consistency. Evaluated on PoliClaim, a 25-year corpus of U.S. political speeches, GPT-4 AFaCTA achieves near-expert accuracy on perfectly consistent samples and can auto-label a substantial portion of data, enabling effective classifier training and data augmentation; results generalize to a social-media domain (CheckThat!-2021-dev). The findings demonstrate that high-quality, self-consistent LLM annotations can substitute for manual labeling in scalable fact-checking work, with practical implications for building large, reliable claim-detection resources and cross-domain applicability.

Abstract

With the rise of generative AI, automated fact-checking methods to combat misinformation are becoming more and more important. However, factual claim detection, the first step in a fact-checking pipeline, suffers from two key issues that limit its scalability and generalizability: (1) inconsistency in definitions of the task and what a claim is, and (2) the high cost of manual annotation. To address (1), we review the definitions in related work and propose a unifying definition of factual claims that focuses on verifiability. To address (2), we introduce AFaCTA (Automatic Factual Claim deTection Annotator), a novel framework that assists in the annotation of factual claims with the help of large language models (LLMs). AFaCTA calibrates its annotation confidence with consistency along three predefined reasoning paths. Extensive evaluation and experiments in the domain of political speech reveal that AFaCTA can efficiently assist experts in annotating factual claims and training high-quality classifiers, and can work with or without expert supervision. Our analyses also result in PoliClaim, a comprehensive claim detection dataset spanning diverse political topics.

AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators

TL;DR

This work tackles two key barriers in factual claim detection—conceptual inconsistency and costly annotation—by introducing a verifiability-based definition of factual claims and AFaCTA, an LLM-assisted annotation framework. AFaCTA uses three prompting steps (Direct Classification, Fact-Extraction CoT, and Reasoning with Debate) followed by a majority-vote aggregation to calibrate reliability via self-consistency. Evaluated on PoliClaim, a 25-year corpus of U.S. political speeches, GPT-4 AFaCTA achieves near-expert accuracy on perfectly consistent samples and can auto-label a substantial portion of data, enabling effective classifier training and data augmentation; results generalize to a social-media domain (CheckThat!-2021-dev). The findings demonstrate that high-quality, self-consistent LLM annotations can substitute for manual labeling in scalable fact-checking work, with practical implications for building large, reliable claim-detection resources and cross-domain applicability.

Abstract

With the rise of generative AI, automated fact-checking methods to combat misinformation are becoming more and more important. However, factual claim detection, the first step in a fact-checking pipeline, suffers from two key issues that limit its scalability and generalizability: (1) inconsistency in definitions of the task and what a claim is, and (2) the high cost of manual annotation. To address (1), we review the definitions in related work and propose a unifying definition of factual claims that focuses on verifiability. To address (2), we introduce AFaCTA (Automatic Factual Claim deTection Annotator), a novel framework that assists in the annotation of factual claims with the help of large language models (LLMs). AFaCTA calibrates its annotation confidence with consistency along three predefined reasoning paths. Extensive evaluation and experiments in the domain of political speech reveal that AFaCTA can efficiently assist experts in annotating factual claims and training high-quality classifiers, and can work with or without expert supervision. Our analyses also result in PoliClaim, a comprehensive claim detection dataset spanning diverse political topics.
Paper Structure (39 sections, 5 equations, 10 figures, 12 tables)

This paper contains 39 sections, 5 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: AFaCTA Pipeline. All steps that need LLM prompting are annotated with the brain icon. Besides the target statement, a short context (if available) is also provided to help the model understand the statement.
  • Figure 2: Left figure: accuracy vs. self-consistency levels achieved by $11$ CoT calls. Self-consistency level $x$ means there are $x$ CoTs that agree on the label and $(11-x)$ disagree. Solid and dashed lines denote the performance of LLMs and random guesses on subsets of different self-consistency correspondingly. Right figure: accuracy on the subset where all $x$ CoTs achieve agreement vs. number of sampled CoTs $x$. Note that the subset of perfect consistency is getting narrower and narrower when sampling more CoTs.
  • Figure 3: The performance of fine-tuned RoBERTa on PoliClaim$_{test}$ when gradually adding training data of different quality. "- -" denotes GPT-4's performance aggregating three AFaCTA reasoning steps.
  • Figure 4: The performance of augmenting a limited number of PoliClaim$_{gold}$ data (left figure: all 1936 samples, right figure: 500 samples) with extra data from PoliClaim$_{silver}$ and PoliClaim$_{bronze}$. Experiments of augmenting 1000 and 1500 PoliClaim$_{gold}$ samples can be found in \ref{['app:further_ft']}. "- -" denotes the performance without augmentation. G, S, and B denote golden, silver, and bronze PoliClaim correspondingly.
  • Figure 5: We notice that in \ref{['fig:self-consistency_cot']}, GPT-3.5's accuracy on the perfectly consistent set does not seem to converge with 11 voters. So we extend the number of CoTs to 19, observing that the accuracy converges to 84.1%.
  • ...and 5 more figures