Table of Contents
Fetching ...

The Alignment Bottleneck in Decomposition-Based Claim Verification

Mahmud Elahi Akhter, Federico Ruggeri, Iman Munire Bilal, Rob Procter, Maria Liakata

TL;DR

The paper investigates the effectiveness of decomposition-based claim verification in real-world, temporally bounded settings. It identifies evidence alignment at the sub-claim level and the reliability of sub-claim labels as critical bottlenecks, and introduces a dataset with manually annotated sub-claim evidence spans to study these effects. By evaluating two evidence alignment schemes (sub-claim aligned evidence SAE and repeated claim-level evidence SRE) across multiple datasets (PHEMEPlus, MMM-Fact, COVID-Fact) and supervision regimes (oracle vs noisy sub-claims), the work demonstrates that decomposition yields tangible gains only when sub-claims are paired with granular, aligned evidence and reliable labels. Conversely, with noisy sub-claim signals or coarse evidence, decomposition can degrade performance, highlighting the need for precise evidence synthesis and calibrated abstention to achieve robust verification in real-world scenarios.

Abstract

Structured claim decomposition is often proposed as a solution for verifying complex, multi-faceted claims, yet empirical results have been inconsistent. We argue that these inconsistencies stem from two overlooked bottlenecks: evidence alignment and sub-claim error profiles. To better understand these factors, we introduce a new dataset of real-world complex claims, featuring temporally bounded evidence and human-annotated sub-claim evidence spans. We evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Our results reveal that decomposition brings significant performance improvement only when evidence is granular and strictly aligned. By contrast, standard setups that rely on repeated claim-level evidence (SRE) fail to improve and often degrade performance as shown across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact). Furthermore, we demonstrate that in the presence of noisy sub-claim labels, the nature of the error ends up determining downstream robustness. We find that conservative "abstention" significantly reduces error propagation compared to aggressive but incorrect predictions. These findings suggest that future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate the label bias of sub-claim verification models.

The Alignment Bottleneck in Decomposition-Based Claim Verification

TL;DR

The paper investigates the effectiveness of decomposition-based claim verification in real-world, temporally bounded settings. It identifies evidence alignment at the sub-claim level and the reliability of sub-claim labels as critical bottlenecks, and introduces a dataset with manually annotated sub-claim evidence spans to study these effects. By evaluating two evidence alignment schemes (sub-claim aligned evidence SAE and repeated claim-level evidence SRE) across multiple datasets (PHEMEPlus, MMM-Fact, COVID-Fact) and supervision regimes (oracle vs noisy sub-claims), the work demonstrates that decomposition yields tangible gains only when sub-claims are paired with granular, aligned evidence and reliable labels. Conversely, with noisy sub-claim signals or coarse evidence, decomposition can degrade performance, highlighting the need for precise evidence synthesis and calibrated abstention to achieve robust verification in real-world scenarios.

Abstract

Structured claim decomposition is often proposed as a solution for verifying complex, multi-faceted claims, yet empirical results have been inconsistent. We argue that these inconsistencies stem from two overlooked bottlenecks: evidence alignment and sub-claim error profiles. To better understand these factors, we introduce a new dataset of real-world complex claims, featuring temporally bounded evidence and human-annotated sub-claim evidence spans. We evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Our results reveal that decomposition brings significant performance improvement only when evidence is granular and strictly aligned. By contrast, standard setups that rely on repeated claim-level evidence (SRE) fail to improve and often degrade performance as shown across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact). Furthermore, we demonstrate that in the presence of noisy sub-claim labels, the nature of the error ends up determining downstream robustness. We find that conservative "abstention" significantly reduces error propagation compared to aggressive but incorrect predictions. These findings suggest that future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate the label bias of sub-claim verification models.
Paper Structure (31 sections, 7 equations, 2 figures, 11 tables)

This paper contains 31 sections, 7 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: shows our annotation and claim verification pipeline and different setups for the study. Oracle_(SAE/SRE) setups use gold sub-claim labels, ablation models do not use any sub-claim labels and noisy setup (not shown in figure) uses predicted sub-claim labels.
  • Figure 2: Graph Neural network classification pipeline.