Cost-effective End-to-end Information Extraction for Semi-structured Document Images
Wonseok Hwang, Hyunji Lee, Jinyeong Yim, Geewook Kim, Minjoon Seo
TL;DR
This work tackles the high complexity and maintenance cost of pipeline-based IE for semi-structured document images by proposing Wyvern, an end-to-end sequence-generation model with a tree-generating transition module. Wyvern leverages a 2D Transformer encoder, a Transformer decoder with gated copying, and an AST-based transition mechanism to produce parses directly from 2D text inputs. The approach achieves competitive results with the traditional POT pipeline, gains substantially when leveraging large weak-label parses, and offers a cost-effective alternative due to reduced annotation and tooling requirements. A production-oriented cost analysis and A/B tests illustrate practical benefits and trade-offs for real-world deployment, highlighting the potential of end-to-end IE in industry settings.
Abstract
A real-world information extraction (IE) system for semi-structured document images often involves a long pipeline of multiple modules, whose complexity dramatically increases its development and maintenance cost. One can instead consider an end-to-end model that directly maps the input to the target output and simplify the entire process. However, such generation approach is known to lead to unstable performance if not designed carefully. Here we present our recent effort on transitioning from our existing pipeline-based IE system to an end-to-end system focusing on practical challenges that are associated with replacing and deploying the system in real, large-scale production. By carefully formulating document IE as a sequence generation task, we show that a single end-to-end IE system can be built and still achieve competent performance.
