Table of Contents
Fetching ...

LMDX: Language Model-based Document Information Extraction and Localization

Vincent Perot, Kai Kang, Florian Luisier, Guolong Su, Xiaoyu Sun, Ramya Sree Boppana, Zilong Wang, Zifeng Wang, Jiaqi Mu, Hao Zhang, Chen-Yu Lee, Nan Hua

TL;DR

LMDX reframes document information extraction for large language models by introducing coordinate-based layout encoding and a groundable decoding pipeline. It delivers zero-shot and data-efficient extraction of leaf and hierarchical entities with precise localization in visually rich documents, demonstrated on VRDU and CORD benchmarks using PaLM 2-S and Gemini Pro. The four-stage methodology—chunking, prompt generation, LLM inference, and decoding—coupled with grounding guarantees and majority-vote merging, yields state-of-the-art results and robust handling of hierarchical structures. The approach reduces annotation costs and enables reliable human-in-the-loop auditing, with strong data-efficiency and localization performance across diverse templates and document types.

Abstract

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art and exhibiting emergent capabilities across various tasks. However, their application in extracting information from visually rich documents, which is at the core of many document processing workflows and involving the extraction of key entities from semi-structured documents, has not yet been successful. The main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs, which is critical for high quality extraction, and the lack of a grounding mechanism to localize the predicted entities within the document. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to reframe the document information extraction task for a LLM. LMDX enables extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. Finally, we apply LMDX to the PaLM 2-S and Gemini Pro LLMs and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

LMDX: Language Model-based Document Information Extraction and Localization

TL;DR

LMDX reframes document information extraction for large language models by introducing coordinate-based layout encoding and a groundable decoding pipeline. It delivers zero-shot and data-efficient extraction of leaf and hierarchical entities with precise localization in visually rich documents, demonstrated on VRDU and CORD benchmarks using PaLM 2-S and Gemini Pro. The four-stage methodology—chunking, prompt generation, LLM inference, and decoding—coupled with grounding guarantees and majority-vote merging, yields state-of-the-art results and robust handling of hierarchical structures. The approach reduces annotation costs and enables reliable human-in-the-loop auditing, with strong data-efficiency and localization performance across diverse templates and document types.

Abstract

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art and exhibiting emergent capabilities across various tasks. However, their application in extracting information from visually rich documents, which is at the core of many document processing workflows and involving the extraction of key entities from semi-structured documents, has not yet been successful. The main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs, which is critical for high quality extraction, and the lack of a grounding mechanism to localize the predicted entities within the document. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to reframe the document information extraction task for a LLM. LMDX enables extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. Finally, we apply LMDX to the PaLM 2-S and Gemini Pro LLMs and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.
Paper Structure (20 sections, 3 figures, 6 tables, 3 algorithms)

This paper contains 20 sections, 3 figures, 6 tables, 3 algorithms.

Figures (3)

  • Figure 1: Overview of the LMDX methodology, decomposing the information extraction and localization task in 4 stages in order to frame it for an LLM. From the document, we generate LLM prompts containing both the text content and coordinate tokens (in color blue), which communicates the layout modality (needed for a high-quality extraction) and act as unique identifiers of the text segments. The prompts also contain the target schema, enabling zero-shot information extraction. The LLM completions, in JSON format, naturally support hierarchical entity extraction (e.g. line_item), and include both entity values and segment identifiers, enabling both entity localization (i.e. computing entity bounding box) and removing LLM hallucination through our decoding algorithm.
  • Figure 2: Structure of the LLM prompts.
  • Figure 3: In-Context Learning results on CORD with random and nearest neighbors retrieval methods for LMDXPaLM 2-S and LMDXGemini Pro.