Table of Contents
Fetching ...

Financial Report Chunking for Effective Retrieval Augmented Generation

Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, Renyu Li

TL;DR

The paper tackles the challenge of long financial reports in Retrieval Augmented Generation by introducing element-type based chunking that leverages document structure (headings, tables, etc.) annotated via document understanding models. It evaluates this approach in a full RAG pipeline (Weaviate vector store, GPT-4 generation) using FinanceBench, contrasting with baseline token-based chunking and variations. Results show that structure-aware chunking delivers superior retrieval and QA performance, reduces the total number of chunks, and operates without chunk-size tuning, indicating strong generalizability. The work suggests that incorporating document structure into chunking can significantly enhance RAG effectiveness for financial documents and potentially other domains.

Abstract

Chunking information is a key step in Retrieval Augmented Generation (RAG). Current research primarily centers on paragraph-level chunking. This approach treats all texts as equal and neglects the information contained in the structure of documents. We propose an expanded approach to chunk documents by moving beyond mere paragraph-level chunking to chunk primary by structural element components of documents. Dissecting documents into these constituent elements creates a new way to chunk documents that yields the best chunk size without tuning. We introduce a novel framework that evaluates how chunking based on element types annotated by document understanding models contributes to the overall context and accuracy of the information retrieved. We also demonstrate how this approach impacts RAG assisted Question & Answer task performance. Our research includes a comprehensive analysis of various element types, their role in effective information retrieval, and the impact they have on the quality of RAG outputs. Findings support that element type based chunking largely improve RAG results on financial reporting. Through this research, we are also able to answer how to uncover highly accurate RAG.

Financial Report Chunking for Effective Retrieval Augmented Generation

TL;DR

The paper tackles the challenge of long financial reports in Retrieval Augmented Generation by introducing element-type based chunking that leverages document structure (headings, tables, etc.) annotated via document understanding models. It evaluates this approach in a full RAG pipeline (Weaviate vector store, GPT-4 generation) using FinanceBench, contrasting with baseline token-based chunking and variations. Results show that structure-aware chunking delivers superior retrieval and QA performance, reduces the total number of chunks, and operates without chunk-size tuning, indicating strong generalizability. The work suggests that incorporating document structure into chunking can significantly enhance RAG effectiveness for financial documents and potentially other domains.

Abstract

Chunking information is a key step in Retrieval Augmented Generation (RAG). Current research primarily centers on paragraph-level chunking. This approach treats all texts as equal and neglects the information contained in the structure of documents. We propose an expanded approach to chunk documents by moving beyond mere paragraph-level chunking to chunk primary by structural element components of documents. Dissecting documents into these constituent elements creates a new way to chunk documents that yields the best chunk size without tuning. We introduce a novel framework that evaluates how chunking based on element types annotated by document understanding models contributes to the overall context and accuracy of the information retrieved. We also demonstrate how this approach impacts RAG assisted Question & Answer task performance. Our research includes a comprehensive analysis of various element types, their role in effective information retrieval, and the impact they have on the quality of RAG outputs. Findings support that element type based chunking largely improve RAG results on financial reporting. Through this research, we are also able to answer how to uncover highly accurate RAG.
Paper Structure (14 sections, 5 figures, 5 tables)

This paper contains 14 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: RAG steps to answer a question about a document
  • Figure 2: Indexing of document chunks into the vector database
  • Figure 3: Example prompt template used by the generator
  • Figure 4: Evaluation prompt template. The $\{question\}$, $\{ground\_truth\_answer\}$ and $\{generated\_answer\}$ fields are substituted for each question accordingly.
  • Figure 5: Evaluation prompt template