Table of Contents
Fetching ...

FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking

Kevin Nanekhan, Venktesh V, Erik Martin, Henrik Vatndal, Vinay Setty, Avishek Anand

TL;DR

The paper tackles the bottleneck in automated fact-checking caused by evidence retrieval from web-scale knowledge sources. It introduces a two-fold approach: (i) corpus compression to extract succinct factual statements via Fact Extraction and Citation Extraction, optionally fused, and (ii) index compression of dense representations using Joint Product Quantization (JPQ) with end-to-end training. This yields substantial reductions in storage (up to ~93% index-size reduction and ~14.4:1 compression) and impressive latency improvements, with CPU speedups up to ~10x and GPU speedups up to ~33x, enabling real-time fact-checking and live-event verification such as the 2024 presidential debate. All data and code are open-sourced, underscoring practical impact for low-resource settings and large-scale, real-time misinformation mitigation.

Abstract

The advances in digital tools have led to the rampant spread of misinformation. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. It is essential for automated fact-checking to be efficient for aiding in combating misinformation in real-time and at the source. Fact-checking pipelines primarily comprise a knowledge retrieval component which extracts relevant knowledge to fact-check a claim from large knowledge sources like Wikipedia and a verification component. The existing works primarily focus on the fact-verification part rather than evidence retrieval from large data collections, which often face scalability issues for practical applications such as live fact-checking. In this study, we address this gap by exploring various methods for indexing a succinct set of factual statements from large collections like Wikipedia to enhance the retrieval phase of the fact-checking pipeline. We also explore the impact of vector quantization to further improve the efficiency of pipelines that employ dense retrieval approaches for first-stage retrieval. We study the efficiency and effectiveness of the approaches on fact-checking datasets such as HoVer and WiCE, leveraging Wikipedia as the knowledge source. We also evaluate the real-world utility of the efficient retrieval approaches by fact-checking 2024 presidential debate and also open source the collection of claims with corresponding labels identified in the debate. Through a combination of indexed facts together with Dense retrieval and Index compression, we achieve up to a 10.0x speedup on CPUs and more than a 20.0x speedup on GPUs compared to the classical fact-checking pipelines over large collections.

FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking

TL;DR

The paper tackles the bottleneck in automated fact-checking caused by evidence retrieval from web-scale knowledge sources. It introduces a two-fold approach: (i) corpus compression to extract succinct factual statements via Fact Extraction and Citation Extraction, optionally fused, and (ii) index compression of dense representations using Joint Product Quantization (JPQ) with end-to-end training. This yields substantial reductions in storage (up to ~93% index-size reduction and ~14.4:1 compression) and impressive latency improvements, with CPU speedups up to ~10x and GPU speedups up to ~33x, enabling real-time fact-checking and live-event verification such as the 2024 presidential debate. All data and code are open-sourced, underscoring practical impact for low-resource settings and large-scale, real-time misinformation mitigation.

Abstract

The advances in digital tools have led to the rampant spread of misinformation. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. It is essential for automated fact-checking to be efficient for aiding in combating misinformation in real-time and at the source. Fact-checking pipelines primarily comprise a knowledge retrieval component which extracts relevant knowledge to fact-check a claim from large knowledge sources like Wikipedia and a verification component. The existing works primarily focus on the fact-verification part rather than evidence retrieval from large data collections, which often face scalability issues for practical applications such as live fact-checking. In this study, we address this gap by exploring various methods for indexing a succinct set of factual statements from large collections like Wikipedia to enhance the retrieval phase of the fact-checking pipeline. We also explore the impact of vector quantization to further improve the efficiency of pipelines that employ dense retrieval approaches for first-stage retrieval. We study the efficiency and effectiveness of the approaches on fact-checking datasets such as HoVer and WiCE, leveraging Wikipedia as the knowledge source. We also evaluate the real-world utility of the efficient retrieval approaches by fact-checking 2024 presidential debate and also open source the collection of claims with corresponding labels identified in the debate. Through a combination of indexed facts together with Dense retrieval and Index compression, we achieve up to a 10.0x speedup on CPUs and more than a 20.0x speedup on GPUs compared to the classical fact-checking pipelines over large collections.

Paper Structure

This paper contains 15 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of Existing and Proposed Fact-Checking Pipelines
  • Figure 2: HoVer and WiCe task performance (FW- Full-Wiki, FE - Fact Extraction, IC- Index Compression, CE - Citation Extraction, Fu - Fusion)
  • Figure 3: Retrieval performance comparison
  • Figure 4: Live fact-checking performance across different corpus setups