DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang; Leonardo F. R. Ribeiro; Momchil Hardalov; Bhuwan Dhingra; Markus Dreyer; Venkatesh Saligrama

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

TL;DR

This work instantiates AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.

Abstract

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

TL;DR

Abstract

Paper Structure (111 sections, 3 equations, 7 figures, 8 tables)

This paper contains 111 sections, 3 equations, 7 figures, 8 tables.

Introduction
Related Work
DRRs Evaluation
Fact-Checking
Reliability in Human Annotations
Problem Formulation
Task: Verifying Factuality in DRRs
Problem 1: Modeling (Building the Verifier).
Problem 2: Benchmarking (Evaluating the Verifier).
Failure of Static Ground Truth
Empirical Analysis: The Unreliability of Expert Verification
Methodology: The Micro-Gold Protocol
1. Unsupported Micro-Golds.
2. Supported Micro-Golds.
Usage and Validation.
...and 96 more sections

Figures (7)

Figure 1: Evolving Benchmarking via Audit-then-Score (AtS).Left: AtS workflow. Right: an example of evolving benchmark. Unlike traditional static benchmarking, AtS treats ground truth $y_i^{(t)}$ as an evolving consensus. The process proceeds in four stages: (1) Evaluate: Run a Challenger agent ($M_t$) on the current benchmark state ($B_t$), producing a verdict $\hat{y}_i$. (2) Challenge: When $\hat{y}_i \neq y_i^{(t)}$, the Challenger submits a proposal with evidence. (3) Audit: An Auditor (human expert or trusted agent) adjudicates the dispute; if the Challenger’s argument is stronger than the incumbent rationale, the update is accepted. (4) Evolve & Score: Accepted updates yield the next benchmark state ($B_{t+1}$); the Challenger is then scored against this refined ground truth.
Figure 2: DeepFact-Eval vs. traditional fact-checkers: left, simplified VeriScore/FactCheck-GPT/SAFE; right, DeepFact-Eval workflow
Figure 3: Benchmark Accuracy Evolution on Micro-golds Across AtS Auditing Rounds with expert auditors.
Figure 4: Agent-only auditing for AtS. For each auditor $A_i$, we report its Round-0 solo accuracy and its Round-1 audited accuracy when auditing another agent $A_j$ ($A_i\!\rightarrow\!A_j$; outer bars). Inner bars within each $A_i\!\rightarrow\!A_j$ show the audited agent’s solo (Round-0) accuracy $A_j$ for reference.
Figure 5: Results of DeepFact-Eval on SciFact, ExpertQA, Factcheck-Bench. Solid green indicates Agreement (verifier's prediction matches the benchmark label). Hatched slices denote Disagreements (verifier's prediction doesn't match the benchmark label). Green-hatched indicates Annotation divergence (e.g., evidence--label misalignment, non-verifiable/ambiguous sentences, subjective or underspecified claims, or annotation divergence can't be resolved due to the lack of a gold rationale), while red-hatched indicates Likely model error (expert re-annotation aligns with the benchmark label).
...and 2 more figures

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

TL;DR

Abstract

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Authors

TL;DR

Abstract

Table of Contents

Figures (7)