Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

Chongjian Yue; Xinrun Xu; Xiaojun Ma; Lun Du; Hengyu Liu; Zhiming Ding; Yanbing Jiang; Shi Han; Dongmei Zhang

Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

Chongjian Yue, Xinrun Xu, Xiaojun Ma, Lun Du, Hengyu Liu, Zhiming Ding, Yanbing Jiang, Shi Han, Dongmei Zhang

TL;DR

This research proposes an Automated Financial Information Extraction (AFIE) framework that enhances LLMs' ability to comprehend and extract information from financial reports and suggests that the AFIE framework offers accuracy for automated numerical extraction from complex, hybrid documents.

Abstract

Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains underexplored. In this research, we specialize in harnessing the potential of LLMs to comprehend critical information from financial reports, which are hybrid long-documents. We propose an Automated Financial Information Extraction (AFIE) framework that enhances LLMs' ability to comprehend and extract information from financial reports. To evaluate AFIE, we develop a Financial Reports Numerical Extraction (FINE) dataset and conduct an extensive experimental analysis. Our framework is effectively validated on GPT-3.5 and GPT-4, yielding average accuracy increases of 53.94% and 33.77%, respectively, compared to a naive method. These results suggest that the AFIE framework offers accuracy for automated numerical extraction from complex, hybrid documents.

Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

TL;DR

Abstract

Paper Structure (38 sections, 1 equation, 9 figures, 15 tables)

This paper contains 38 sections, 1 equation, 9 figures, 15 tables.

Introduction
Prepared Work
Automated Information Extraction
Segmentation
Retrieval
Summarization
Extraction
Prompt Engineering
Dataset
Datasets on Three Domains
Evaluation Metrics
Overall Performance on Three Domains
Adaptability for LLMs with Different Capabilities
Capability to Handle Ambiguity
Analysis of Table Serialization Formats
...and 23 more sections

Figures (9)

Figure 1: The AIE framework illustrates the end-to-end IE process, consisting of four modules: Segmentation, dividing lengthy documents into short segments; Retrieval, selecting the most relevant segments related to the given keyword; Summarization, using LLMs to generate a concise summary of relevant information; and Extraction, extracting the keyword-corresponding value from the summary. This framework is exemplified using financial reports.
Figure 2: Comparison of the Naive method and AIE at different RETA levels on FINE.
Figure 3: Comparison of the Naive method and AIE on WIKIR and MPP.
Figure 4: Comparison of the Naive method and AIE at different RETA levels on GPT-4.
Figure 5: Exploring the Capability to Handle Keyword Ambiguity: Comparison of Naive and AIE on RPD
...and 4 more figures

Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

TL;DR

Abstract

Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (9)