Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset
Chongjian Yue, Xinrun Xu, Xiaojun Ma, Lun Du, Zhiming Ding, Shi Han, Dongmei Zhang, Qi Zhang
TL;DR
The paper tackles information extraction from Hybrid Long Documents (HLDs) that combine text and tables beyond typical LLM token limits. It introduces the Automated Information Extraction (AIE) framework, a four-module pipeline (Segmentation, Retrieval, Summarization, Extraction) augmented by prompt engineering and a simple table serialization approach, and evaluates it on three datasets (FINE, WIKIR, MPP) using the $RETA$ metric for FINE. The study analyzes segmentation strategies (Refine vs Map-Reduce), table representations, retrieval settings, and prompt designs, demonstrating that AIE consistently outperforms naive LLM baselines and generalizes across domains and even powerful models like GPT-4. The work contributes a new financial data–focused dataset, FINE, and a publicly available codebase to advance robust HLD information extraction in real-world settings.
Abstract
Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains unexplored. The hybrid text often appears in the form of hybrid long documents (HLDs), which far exceed the token limit of LLMs. Consequently, we apply an Automated Information Extraction framework (AIE) to enable LLMs to process the HLDs and carry out experiments to analyse four important aspects of information extraction from HLDs. Given the findings: 1) The effective way to select and summarize the useful part of a HLD. 2) An easy table serialization way is enough for LLMs to understand tables. 3) The naive AIE has adaptability in many complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset and code are publicly available in the attachments.
