Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use
Franz Louis Cesista, Rui Aguiar, Jason Kim, Paolo Acilo
TL;DR
This paper reframes Business Document Information Extraction (BDIE) as a Tool Use problem and introduces Retrieval Augmented Structured Generation (RASG), a four-component framework that combines Retrieval Augmented Generation, Supervised Finetuning, Structured Generation, and Structured Prompting to produce parseable structured outputs for downstream tools. It also introduces General Line Items Recognition Metric (GLIRM), a comprehensive evaluation scheme for line-item extraction that accounts for subtask isolation, cell-level accuracy, and permutation invariances, along with a bounding-box backcalculation heuristic to derive spatial bounds without explicit vision encoders. Through extensive experiments on the DocILE dataset, the study demonstrates that Large Language Models with RASG can achieve state-of-the-art-like performance on KIE and LIR tasks, sometimes surpassing strong multimodal baselines, especially when leveraging retrieval and prompting strategies. The findings suggest practical BDIE guidance: start with off-the-shelf LLMs capable of structured generation, add retrieval, and consider fine-tuning for robust performance, with LIR benefiting most from prompt engineering and structured outputs. Overall, the work offers a scalable, tool-oriented approach to BDIE with broad implications for integrating unstructured documents with downstream systems.
Abstract
Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks. The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs + RASG is oftentimes superior given real-world applications and constraints of BDIE.
