Table of Contents
Fetching ...

Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Amine Kobeissi, Philippe Langlais

TL;DR

The paper targets high-stakes financial QA by dissecting retrieval failures in retrieval-augmented generation when long regulatory filings are involved. It introduces an oracle-based retrieval-gap analysis that separates document discovery from within-document retrieval (page and chunk) and evaluates multiple retrieval strategies on FinanceBench. A domain-tuned page scorer is proposed to rank pages before chunk retrieval, demonstrating substantial gains in page recall and chunk retrieval, and improving downstream generation metrics. The findings highlight a remaining gap in page- and chunk-level retrieval even with correct document discovery, with practical implications for building reliable, verifiable financial QA systems.

Abstract

Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.

Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

TL;DR

The paper targets high-stakes financial QA by dissecting retrieval failures in retrieval-augmented generation when long regulatory filings are involved. It introduces an oracle-based retrieval-gap analysis that separates document discovery from within-document retrieval (page and chunk) and evaluates multiple retrieval strategies on FinanceBench. A domain-tuned page scorer is proposed to rank pages before chunk retrieval, demonstrating substantial gains in page recall and chunk retrieval, and improving downstream generation metrics. The findings highlight a remaining gap in page- and chunk-level retrieval even with correct document discovery, with practical implications for building reliable, verifiable financial QA systems.

Abstract

Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.
Paper Structure (26 sections, 7 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 7 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Maximum BLEU and ROUGE-L scores between retrieved chunks and gold chunks under standard retrieval and oracle settings
  • Figure 2: Overview of RAG pipeline. Documents are decomposed into pages and chunks. For a query, blue pages and chunks represent the gold context containing the answer.
  • Figure 3: Document and page recall at $k=5$ by question type. 50 questions per type