Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task
Zilong Wang, Xiaoyu Shen
TL;DR
The paper addresses scalable information extraction from copy-heavy enterprise documents by exploiting structural redundancy. It introduces a modular hybrid OCR-LLM framework with three extraction paradigms (Direct, Replace, Table) plus a multimodal baseline, and a document-aware controller for format-aware routing. Across 25 configurations and four formats, it shows table-based extraction with structure-preserving OCR achieves perfect or near-perfect F1 with sub-second latency on structured formats, while multimodal methods excel in image inputs but incur orders-of-magnitude higher latency. The key finding is that format-aware, hybrid pipelines can deliver up to about 54× speedups over universal multimodal approaches, offering a practical blueprint for production-scale copy-heavy document processing.
Abstract
Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks. Unlike existing approaches that pursue universal solutions, our method exploits document-specific characteristics through intelligent strategy selection. We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats (PNG, DOCX, XLSX, PDF). Through table-based extraction methods, our adaptive framework delivers outstanding results: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6 s for challenging image inputs when integrated with PaddleOCR, all while maintaining sub-second processing speeds. The 54 times performance improvement compared with multimodal methods over naive approaches, coupled with format-aware routing, enables processing of heterogeneous document streams at production scale. Beyond the specific application to identity extraction, this work establishes a general principle: the repetitive nature of copy-heavy tasks can be transformed from a computational burden into an optimization opportunity through structure-aware method selection.
