Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Chongjian Yue; Xinrun Xu; Xiaojun Ma; Lun Du; Zhiming Ding; Shi Han; Dongmei Zhang; Qi Zhang

Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Chongjian Yue, Xinrun Xu, Xiaojun Ma, Lun Du, Zhiming Ding, Shi Han, Dongmei Zhang, Qi Zhang

TL;DR

The paper tackles information extraction from Hybrid Long Documents (HLDs) that combine text and tables beyond typical LLM token limits. It introduces the Automated Information Extraction (AIE) framework, a four-module pipeline (Segmentation, Retrieval, Summarization, Extraction) augmented by prompt engineering and a simple table serialization approach, and evaluates it on three datasets (FINE, WIKIR, MPP) using the $RETA$ metric for FINE. The study analyzes segmentation strategies (Refine vs Map-Reduce), table representations, retrieval settings, and prompt designs, demonstrating that AIE consistently outperforms naive LLM baselines and generalizes across domains and even powerful models like GPT-4. The work contributes a new financial data–focused dataset, FINE, and a publicly available codebase to advance robust HLD information extraction in real-world settings.

Abstract

Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains unexplored. The hybrid text often appears in the form of hybrid long documents (HLDs), which far exceed the token limit of LLMs. Consequently, we apply an Automated Information Extraction framework (AIE) to enable LLMs to process the HLDs and carry out experiments to analyse four important aspects of information extraction from HLDs. Given the findings: 1) The effective way to select and summarize the useful part of a HLD. 2) An easy table serialization way is enough for LLMs to understand tables. 3) The naive AIE has adaptability in many complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset and code are publicly available in the attachments.

Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

TL;DR

metric for FINE. The study analyzes segmentation strategies (Refine vs Map-Reduce), table representations, retrieval settings, and prompt designs, demonstrating that AIE consistently outperforms naive LLM baselines and generalizes across domains and even powerful models like GPT-4. The work contributes a new financial data–focused dataset, FINE, and a publicly available codebase to advance robust HLD information extraction in real-world settings.

Abstract

Paper Structure (6 sections, 1 equation, 5 figures, 7 tables)

This paper contains 6 sections, 1 equation, 5 figures, 7 tables.

Introduction
AIE Framework
Dataset and Evaluation Metrics
Experiment
Related Work
Conclusion

Figures (5)

Figure 1: The AIE framework illustrates the end-to-end IE process, consisting of four modules: Segmentation, Retrieval, Summarization, and Extraction, extracting the keyword-corresponding value from the summary.
Figure 2: Comparison of the Naive method and AIE at different RETA levels on FINE.
Figure 3: Comparison of the Naive method and AIE on WIKIR and MPP using GPT-3.5.
Figure 4: Exploring the Capability to Handle Keyword Ambiguity: Comparison of Naive and AIE on RPD
Figure 5: Illustration of the Map-Reduce Strategy.

Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

TL;DR

Abstract

Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (5)