Table of Contents
Fetching ...

Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

Zilong Wang, Xiaoyu Shen

TL;DR

The paper addresses scalable information extraction from copy-heavy enterprise documents by exploiting structural redundancy. It introduces a modular hybrid OCR-LLM framework with three extraction paradigms (Direct, Replace, Table) plus a multimodal baseline, and a document-aware controller for format-aware routing. Across 25 configurations and four formats, it shows table-based extraction with structure-preserving OCR achieves perfect or near-perfect F1 with sub-second latency on structured formats, while multimodal methods excel in image inputs but incur orders-of-magnitude higher latency. The key finding is that format-aware, hybrid pipelines can deliver up to about 54× speedups over universal multimodal approaches, offering a practical blueprint for production-scale copy-heavy document processing.

Abstract

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks. Unlike existing approaches that pursue universal solutions, our method exploits document-specific characteristics through intelligent strategy selection. We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats (PNG, DOCX, XLSX, PDF). Through table-based extraction methods, our adaptive framework delivers outstanding results: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6 s for challenging image inputs when integrated with PaddleOCR, all while maintaining sub-second processing speeds. The 54 times performance improvement compared with multimodal methods over naive approaches, coupled with format-aware routing, enables processing of heterogeneous document streams at production scale. Beyond the specific application to identity extraction, this work establishes a general principle: the repetitive nature of copy-heavy tasks can be transformed from a computational burden into an optimization opportunity through structure-aware method selection.

Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

TL;DR

The paper addresses scalable information extraction from copy-heavy enterprise documents by exploiting structural redundancy. It introduces a modular hybrid OCR-LLM framework with three extraction paradigms (Direct, Replace, Table) plus a multimodal baseline, and a document-aware controller for format-aware routing. Across 25 configurations and four formats, it shows table-based extraction with structure-preserving OCR achieves perfect or near-perfect F1 with sub-second latency on structured formats, while multimodal methods excel in image inputs but incur orders-of-magnitude higher latency. The key finding is that format-aware, hybrid pipelines can deliver up to about 54× speedups over universal multimodal approaches, offering a practical blueprint for production-scale copy-heavy document processing.

Abstract

Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks. Unlike existing approaches that pursue universal solutions, our method exploits document-specific characteristics through intelligent strategy selection. We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats (PNG, DOCX, XLSX, PDF). Through table-based extraction methods, our adaptive framework delivers outstanding results: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6 s for challenging image inputs when integrated with PaddleOCR, all while maintaining sub-second processing speeds. The 54 times performance improvement compared with multimodal methods over naive approaches, coupled with format-aware routing, enables processing of heterogeneous document streams at production scale. Beyond the specific application to identity extraction, this work establishes a general principle: the repetitive nature of copy-heavy tasks can be transformed from a computational burden into an optimization opportunity through structure-aware method selection.

Paper Structure

This paper contains 13 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Document examples for copy-heavy tasks: (a) insurance claim; (b) government form; (c) financial report.
  • Figure 2: Architecture of the extraction framework showing the two main components: OCR processing, and LLM-based extraction.
  • Figure 3: Extraction methods for copy-heavy tasks: (a) Document Structure Template; (b) Direct extraction; (c) Replace extraction; (d) Table extraction.
  • Figure 4: Performance comparison of extraction methods across document formats. The heatmap displays $F_1$ scores and processing time for 16 extraction methods (rows) tested on four document formats (columns). Empty cells denote unsupported format-method combinations. The multimodal method employs Qwen2.5-VL-7B, while all other methods utilize Qwen2.5-7B.
  • Figure 5: Performance on PNG-based documents: (a) $F_1$ score with standard deviation. (b) Latency comparison across pipelines.
  • ...and 2 more figures