Table of Contents
Fetching ...

Cost-effective End-to-end Information Extraction for Semi-structured Document Images

Wonseok Hwang, Hyunji Lee, Jinyeong Yim, Geewook Kim, Minjoon Seo

TL;DR

This work tackles the high complexity and maintenance cost of pipeline-based IE for semi-structured document images by proposing Wyvern, an end-to-end sequence-generation model with a tree-generating transition module. Wyvern leverages a 2D Transformer encoder, a Transformer decoder with gated copying, and an AST-based transition mechanism to produce parses directly from 2D text inputs. The approach achieves competitive results with the traditional POT pipeline, gains substantially when leveraging large weak-label parses, and offers a cost-effective alternative due to reduced annotation and tooling requirements. A production-oriented cost analysis and A/B tests illustrate practical benefits and trade-offs for real-world deployment, highlighting the potential of end-to-end IE in industry settings.

Abstract

A real-world information extraction (IE) system for semi-structured document images often involves a long pipeline of multiple modules, whose complexity dramatically increases its development and maintenance cost. One can instead consider an end-to-end model that directly maps the input to the target output and simplify the entire process. However, such generation approach is known to lead to unstable performance if not designed carefully. Here we present our recent effort on transitioning from our existing pipeline-based IE system to an end-to-end system focusing on practical challenges that are associated with replacing and deploying the system in real, large-scale production. By carefully formulating document IE as a sequence generation task, we show that a single end-to-end IE system can be built and still achieve competent performance.

Cost-effective End-to-end Information Extraction for Semi-structured Document Images

TL;DR

This work tackles the high complexity and maintenance cost of pipeline-based IE for semi-structured document images by proposing Wyvern, an end-to-end sequence-generation model with a tree-generating transition module. Wyvern leverages a 2D Transformer encoder, a Transformer decoder with gated copying, and an AST-based transition mechanism to produce parses directly from 2D text inputs. The approach achieves competitive results with the traditional POT pipeline, gains substantially when leveraging large weak-label parses, and offers a cost-effective alternative due to reduced annotation and tooling requirements. A production-oriented cost analysis and A/B tests illustrate practical benefits and trade-offs for real-world deployment, highlighting the potential of end-to-end IE in industry settings.

Abstract

A real-world information extraction (IE) system for semi-structured document images often involves a long pipeline of multiple modules, whose complexity dramatically increases its development and maintenance cost. One can instead consider an end-to-end model that directly maps the input to the target output and simplify the entire process. However, such generation approach is known to lead to unstable performance if not designed carefully. Here we present our recent effort on transitioning from our existing pipeline-based IE system to an end-to-end system focusing on practical challenges that are associated with replacing and deploying the system in real, large-scale production. By carefully formulating document IE as a sequence generation task, we show that a single end-to-end IE system can be built and still achieve competent performance.

Paper Structure

This paper contains 33 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 1: The scheme of (a) our tagging based IE system and (b) the end-to-end IE system proposed in this study.
  • Figure 2: Tree representation of document. (a) An example of abstract syntax tree (AST). (b) An example of TED calculation.
  • Figure 3: Precision-recall curves of three IE tasks: Japanese name card (nj), Korean receipt (rk), and Japanese receipt (rj). The recall rate is controlled by trimming documents of which Wyvern shows low confidence. The confidence score is calculated empirically by averaging the token generation probabilities.