Table of Contents
Fetching ...

Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

Jaeyoung Choe, Jihoon Kim, Woohwan Jung

TL;DR

This work tackles the difficulty of open-domain QA over standardized financial documents, where boilerplate and near-identical tables hinder accurate retrieval. It introduces HiREC, a hierarchical retrieval with evidence curation framework that first narrows to related documents and then selects pertinent passages, followed by filtering, sufficiency checking, and complementary question generation to fill evidence gaps. The LOFin benchmark, comprising about 145k SEC filings and 1,595 QA pairs, provides a realistic testbed for multi-document and multi-hop reasoning in finance. Empirical results show that HiREC outperforms strong baselines and commercial web-search–based systems in retrieval quality and answer accuracy, while also improving efficiency through targeted evidence usage. The work makes LOFin publicly available and demonstrates that a purely LLM-driven, iterative curation pipeline can achieve robust, cost-effective financial QA without heavy training on specialized data.

Abstract

Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.

Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

TL;DR

This work tackles the difficulty of open-domain QA over standardized financial documents, where boilerplate and near-identical tables hinder accurate retrieval. It introduces HiREC, a hierarchical retrieval with evidence curation framework that first narrows to related documents and then selects pertinent passages, followed by filtering, sufficiency checking, and complementary question generation to fill evidence gaps. The LOFin benchmark, comprising about 145k SEC filings and 1,595 QA pairs, provides a realistic testbed for multi-document and multi-hop reasoning in finance. Empirical results show that HiREC outperforms strong baselines and commercial web-search–based systems in retrieval quality and answer accuracy, while also improving efficiency through targeted evidence usage. The work makes LOFin publicly available and demonstrates that a purely LLM-driven, iterative curation pipeline can achieve robust, cost-effective financial QA without heavy training on specialized data.

Abstract

Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.

Paper Structure

This paper contains 59 sections, 1 equation, 5 figures, 24 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of a naive RAG approach and HiREC .
  • Figure 2: Overview of hierarchical retrieval with evidence curation framework
  • Figure 3: Comparison of company, document, page error rates for HiREC and baselines.
  • Figure 4: Recall, precision, and passages per query by iteration. EC stands for Evidence curation.
  • Figure 5: Precision-recall curve (recall on X-axis, precision on Y-axis)