DocFinQA: A Long-Context Financial Reasoning Dataset

Varshini Reddy; Rik Koncel-Kedziorski; Viet Dac Lai; Michael Krumdick; Charles Lovering; Chris Tanner

DocFinQA: A Long-Context Financial Reasoning Dataset

Varshini Reddy, Rik Koncel-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, Chris Tanner

TL;DR

DocFinQA tackles the need for realistic long-context financial QA by extending FinQA to full SEC filings, creating a dataset with questions tied to documents on the order of $10^5$ tokens and accompanied by executable Python programs. The authors evaluate retrieval-based pipelines and long-context LLMs, showing that dense retrievers fine-tuned on DocFinQA outperform baselines, but many questions remain unanswerable for 100K+ token documents. A case study on the longest documents demonstrates that retrieval-assisted GPT-4 can approach or match non-expert performance, but still lags human experts in many cases, highlighting the challenge of cross-chunk reasoning and context disambiguation. Overall, DocFinQA provides a more realistic benchmark for financial numerical reasoning and long-form document understanding, with potential impact on other domains requiring long-range context.

Abstract

For large language models (LLMs) to be effective in the financial domain -- where each decision can have a significant impact -- it is necessary to investigate realistic tasks and data. Financial professionals often interact with documents that are hundreds of pages long, but most financial research datasets only deal with short excerpts from these documents. To address this, we introduce a long-document financial QA task. We augment 7,437 questions from the existing FinQA dataset with the full-document context, extending the average context length from under 700 words in FinQA to 123k words in DocFinQA. We conduct extensive experiments over retrieval-based QA pipelines and long-context language models. DocFinQA proves a significant challenge for even state-of-the-art systems. We also provide a case-study on the longest documents in DocFinQA and find that models particularly struggle on these documents. Addressing these challenges may have a wide reaching impact across applications where specificity and long-range contexts are critical, like gene sequences and legal document contract analysis.

DocFinQA: A Long-Context Financial Reasoning Dataset

TL;DR

DocFinQA tackles the need for realistic long-context financial QA by extending FinQA to full SEC filings, creating a dataset with questions tied to documents on the order of

tokens and accompanied by executable Python programs. The authors evaluate retrieval-based pipelines and long-context LLMs, showing that dense retrievers fine-tuned on DocFinQA outperform baselines, but many questions remain unanswerable for 100K+ token documents. A case study on the longest documents demonstrates that retrieval-assisted GPT-4 can approach or match non-expert performance, but still lags human experts in many cases, highlighting the challenge of cross-chunk reasoning and context disambiguation. Overall, DocFinQA provides a more realistic benchmark for financial numerical reasoning and long-form document understanding, with potential impact on other domains requiring long-range context.

Abstract

Paper Structure (17 sections, 11 figures, 5 tables)

This paper contains 17 sections, 11 figures, 5 tables.

Introduction
Related Work
DocFinQA Dataset
Retrieval-based QA Evaluation
Question Answering Task
Case Study w/ 100K+ Token Documents
Conclusion
SEC Filing Collection
Parsing SEC Filings
Code Conversion
Model Details
Distribution of question by question types
Performance of retrieval methods
ColBERT Finetuning
Few-shot Settings
...and 2 more sections

Figures (11)

Figure 1: DocFinQA extends FinQA to documents often over 150 pages long (100K+ tokens), so it is difficult to find the pertinent information. The question for the example above is: "For the quarter December 31, 2012 what was the percent of the total number of shares purchased in December?" The correct answer is 16.5%.
Figure 2: Histogram of document length (#words) in DocFinQA dataset with dash line representing the average length of the documents. The purple line depicts the proportion of documents where the question context is within the current number of words.
Figure 3: Hit rate of retrieval models.
Figure 4: Accuracy for varying HR@ for two context extraction methods.
Figure 5: Example of code conversion. (a) Original FinQA's derivation. (b) Dummy Python Program (c) Meaningful Python Code in DocFinQA.
...and 6 more figures

DocFinQA: A Long-Context Financial Reasoning Dataset

TL;DR

Abstract

DocFinQA: A Long-Context Financial Reasoning Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (11)