Table of Contents
Fetching ...

FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents

Yilun Zhao, Yitao Long, Yuru Jiang, Chengye Wang, Weiyuan Chen, Hongjun Liu, Yiming Zhang, Xiangru Tang, Chen Zhao, Arman Cohan

TL;DR

FinDVer can serve as a valuable benchmark for evaluating LLM capabilities in claim verification over complex, expert-domain documents and show that even the current best-performing system (i.e., GPT-4o) significantly lags behind human experts.

Abstract

We introduce FinDVer, a comprehensive benchmark specifically designed to evaluate the explainable claim verification capabilities of LLMs in the context of understanding and analyzing long, hybrid-content financial documents. FinDVer contains 2,400 expert-annotated examples, divided into three subsets: information extraction, numerical reasoning, and knowledge-intensive reasoning, each addressing common scenarios encountered in real-world financial contexts. We assess a broad spectrum of LLMs under long-context and RAG settings. Our results show that even the current best-performing system, GPT-4o, still lags behind human experts. We further provide in-depth analysis on long-context and RAG setting, Chain-of-Thought reasoning, and model reasoning errors, offering insights to drive future advancements. We believe that FinDVer can serve as a valuable benchmark for evaluating LLMs in claim verification over complex, expert-domain documents.

FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents

TL;DR

FinDVer can serve as a valuable benchmark for evaluating LLM capabilities in claim verification over complex, expert-domain documents and show that even the current best-performing system (i.e., GPT-4o) significantly lags behind human experts.

Abstract

We introduce FinDVer, a comprehensive benchmark specifically designed to evaluate the explainable claim verification capabilities of LLMs in the context of understanding and analyzing long, hybrid-content financial documents. FinDVer contains 2,400 expert-annotated examples, divided into three subsets: information extraction, numerical reasoning, and knowledge-intensive reasoning, each addressing common scenarios encountered in real-world financial contexts. We assess a broad spectrum of LLMs under long-context and RAG settings. Our results show that even the current best-performing system, GPT-4o, still lags behind human experts. We further provide in-depth analysis on long-context and RAG setting, Chain-of-Thought reasoning, and model reasoning errors, offering insights to drive future advancements. We believe that FinDVer can serve as a valuable benchmark for evaluating LLMs in claim verification over complex, expert-domain documents.

Paper Structure

This paper contains 31 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An example from the numerical reasoning subset of the FinDVer benchmark. To verify the claim, the LLM is required to first locate claim-relevant data points within long and hybrid-content financial documents, and then apply numerical reasoning over the extracted data points for claim verification.
  • Figure 2: An overview of the FinDVer construction pipeline. We collect and process quarterly and annual reports from companies, which contain both tables and text, as source financial documents (§\ref{['sec:document-collection']}). For each financial document, expert annotators are first tasked with annotating the "entailed" claims. Next, they are asked to perturb these "entailed" claims to introduce factual errors, making the original claims into "refuted" claims for the purpose of "refuted" claim annotation (§\ref{['sec:claim-annotaion']}). For each claim, the annotators are required to provide supporting evidence and an explanation of their reasoning process (§\ref{['sec:reasoning-annotation']}). Finally, each annotated example undergoes quality validation by a separate expert annotator (§\ref{['sec:quality-validation']}). This designed data construction pipeline ensures the high quality of FinDVer.
  • Figure 3: Comparison of LLM performance on the testmini split in long-context versus RAG settings using the CoT prompting method.
  • Figure 4: An example within FinDVertestmini set
  • Figure 5: The Chain-of-Thought prompt used.
  • ...and 1 more figures