Table of Contents
Fetching ...

Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

Chongjian Yue, Xinrun Xu, Xiaojun Ma, Lun Du, Hengyu Liu, Zhiming Ding, Yanbing Jiang, Shi Han, Dongmei Zhang

TL;DR

This research proposes an Automated Financial Information Extraction (AFIE) framework that enhances LLMs' ability to comprehend and extract information from financial reports and suggests that the AFIE framework offers accuracy for automated numerical extraction from complex, hybrid documents.

Abstract

Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains underexplored. In this research, we specialize in harnessing the potential of LLMs to comprehend critical information from financial reports, which are hybrid long-documents. We propose an Automated Financial Information Extraction (AFIE) framework that enhances LLMs' ability to comprehend and extract information from financial reports. To evaluate AFIE, we develop a Financial Reports Numerical Extraction (FINE) dataset and conduct an extensive experimental analysis. Our framework is effectively validated on GPT-3.5 and GPT-4, yielding average accuracy increases of 53.94% and 33.77%, respectively, compared to a naive method. These results suggest that the AFIE framework offers accuracy for automated numerical extraction from complex, hybrid documents.

Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

TL;DR

This research proposes an Automated Financial Information Extraction (AFIE) framework that enhances LLMs' ability to comprehend and extract information from financial reports and suggests that the AFIE framework offers accuracy for automated numerical extraction from complex, hybrid documents.

Abstract

Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains underexplored. In this research, we specialize in harnessing the potential of LLMs to comprehend critical information from financial reports, which are hybrid long-documents. We propose an Automated Financial Information Extraction (AFIE) framework that enhances LLMs' ability to comprehend and extract information from financial reports. To evaluate AFIE, we develop a Financial Reports Numerical Extraction (FINE) dataset and conduct an extensive experimental analysis. Our framework is effectively validated on GPT-3.5 and GPT-4, yielding average accuracy increases of 53.94% and 33.77%, respectively, compared to a naive method. These results suggest that the AFIE framework offers accuracy for automated numerical extraction from complex, hybrid documents.
Paper Structure (38 sections, 1 equation, 9 figures, 15 tables)

This paper contains 38 sections, 1 equation, 9 figures, 15 tables.

Figures (9)

  • Figure 1: The AIE framework illustrates the end-to-end IE process, consisting of four modules: Segmentation, dividing lengthy documents into short segments; Retrieval, selecting the most relevant segments related to the given keyword; Summarization, using LLMs to generate a concise summary of relevant information; and Extraction, extracting the keyword-corresponding value from the summary. This framework is exemplified using financial reports.
  • Figure 2: Comparison of the Naive method and AIE at different RETA levels on FINE.
  • Figure 3: Comparison of the Naive method and AIE on WIKIR and MPP.
  • Figure 4: Comparison of the Naive method and AIE at different RETA levels on GPT-4.
  • Figure 5: Exploring the Capability to Handle Keyword Ambiguity: Comparison of Naive and AIE on RPD
  • ...and 4 more figures