Table of Contents
Fetching ...

Efficiency and Effectiveness of LLM-Based Summarization of Evidence in Crowdsourced Fact-Checking

Kevin Roitero, Dustin Wright, Michael Soprano, Isabelle Augenstein, Stefano Mizzaro

TL;DR

The paper addresses scalable truthfulness assessment by crowdsourcing using two evidence formats: full-length webpages (Standard) and LLM-generated summaries (Summary). Through an A/B experiment on PolitiFact statements, it shows that Summary achieves comparable accuracy and error metrics while significantly increasing throughput and reducing costs, aided by higher internal agreement among workers. The findings indicate that evidence summarization preserves essential content and can accelerate large-scale fact-checking without sacrificing quality, though careful handling of potential nuance loss is needed. This work has practical implications for deploying cost-efficient, scalable misinformation evaluation in real-world settings and motivates further cross-domain and per-category tailoring of summaries.

Abstract

Evaluating the truthfulness of online content is critical for combating misinformation. This study examines the efficiency and effectiveness of crowdsourced truthfulness assessments through a comparative analysis of two approaches: one involving full-length webpages as evidence for each claim, and another using summaries for each evidence document generated with a large language model. Using an A/B testing setting, we engage a diverse pool of participants tasked with evaluating the truthfulness of statements under these conditions. Our analysis explores both the quality of assessments and the behavioral patterns of participants. The results reveal that relying on summarized evidence offers comparable accuracy and error metrics to the Standard modality while significantly improving efficiency. Workers in the Summary setting complete a significantly higher number of assessments, reducing task duration and costs. Additionally, the Summary modality maximizes internal agreement and maintains consistent reliance on and perceived usefulness of evidence, demonstrating its potential to streamline large-scale truthfulness evaluations.

Efficiency and Effectiveness of LLM-Based Summarization of Evidence in Crowdsourced Fact-Checking

TL;DR

The paper addresses scalable truthfulness assessment by crowdsourcing using two evidence formats: full-length webpages (Standard) and LLM-generated summaries (Summary). Through an A/B experiment on PolitiFact statements, it shows that Summary achieves comparable accuracy and error metrics while significantly increasing throughput and reducing costs, aided by higher internal agreement among workers. The findings indicate that evidence summarization preserves essential content and can accelerate large-scale fact-checking without sacrificing quality, though careful handling of potential nuance loss is needed. This work has practical implications for deploying cost-efficient, scalable misinformation evaluation in real-world settings and motivates further cross-domain and per-category tailoring of summaries.

Abstract

Evaluating the truthfulness of online content is critical for combating misinformation. This study examines the efficiency and effectiveness of crowdsourced truthfulness assessments through a comparative analysis of two approaches: one involving full-length webpages as evidence for each claim, and another using summaries for each evidence document generated with a large language model. Using an A/B testing setting, we engage a diverse pool of participants tasked with evaluating the truthfulness of statements under these conditions. Our analysis explores both the quality of assessments and the behavioral patterns of participants. The results reveal that relying on summarized evidence offers comparable accuracy and error metrics to the Standard modality while significantly improving efficiency. Workers in the Summary setting complete a significantly higher number of assessments, reducing task duration and costs. Additionally, the Summary modality maximizes internal agreement and maintains consistent reliance on and perceived usefulness of evidence, demonstrating its potential to streamline large-scale truthfulness evaluations.

Paper Structure

This paper contains 20 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Evidence summarization prompt.
  • Figure 2: Agreement between workers and experts on individual and aggregated assessments.
  • Figure 3: Krippendorff's $\alpha$ score.
  • Figure 4: Accuracy across sampling sizes.
  • Figure 5: Time elapsed on statements for the two modalities.
  • ...and 2 more figures