Table of Contents
Fetching ...

What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis

Delip Rao, Chris Callison-Burch

Abstract

Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.

What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis

Abstract

Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.

Paper Structure

This paper contains 34 sections, 4 figures.

Figures (4)

  • Figure 1: Surface matching fails on claims requiring reasoning. A state-of-the-art 7B verifier tang2024minicheck rejects the claim because $100^{\circ}C$ does not appear in the document, even though $100^{\circ}C = 212^{\circ}F$. A compact reasoning verifier that generates an explicit reasoning trace before deciding correctly handles the unit conversion. This example motivates our analysis: if benchmarks primarily reward surface matching, high scores may not reflect genuine verification ability.
  • Figure 2: Distribution of reasoning patterns across 24.1K claim verification traces. Direct evidence extraction (A) dominates the verification strategies (27,988 instances), followed by other reasoning strategies. See Section \ref{['subsec:reasoning_patterns']} for pattern definitions.
  • Figure 3: Distribution of reasoning strategies across the nine source datasets in LLMAggreFact. Each subplot (A--F) corresponds to a reasoning pattern: (A) Direct evidence extraction & matching, (B) Handling nuance & implication, (C) Absence of evidence identification, (D) Synthesis of multiple information points, (E) Addressing scope & specificity mismatches, and (F) Step-by-step verification. Different datasets elicit markedly different reasoning patterns.
  • Figure 4: Distribution of error types on (a) LLMAggreFact, (b) SciFact, and (c) a math-reasoning benchmark derived from GSM8K. Error profiles vary dramatically across domains: general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures.