Table of Contents
Fetching ...

VRDSynth: Synthesizing Programs for Multilingual Visually Rich Document Information Extraction

Thanh-Dat Nguyen, Tung Do-Viet, Hung Nguyen-Duy, Tuan-Hai Luu, Hung Le, Bach Le, Patanamon, Thongtanunam

TL;DR

This work tackles information extraction from multilingual visually rich documents under data scarcity by integrating Graph Neural Networks with program synthesis. It introduces Bridgar, a system that converts noisy GNN predictions into a domain-specific language to guide a probabilistic program synthesizer, and employs the EC^2 framework to learn reusable post-processing rules. The approach combines a multi-layer GCN, GNN Explainer-derived rules, a DSL-based transformation, and rule synthesis to refine extractions, aiming for better generalization and explainability. Empirical results demonstrate improved performance over baselines on VRD benchmarks and a notable reduction in memory footprint, underscoring the practical impact of combining deep learning with symbolic synthesis for VRD IE.

Abstract

Businesses need to query visually rich documents (VRDs) like receipts, medical records, and insurance forms to make decisions. Existing techniques for extracting entities from VRDs struggle with new layouts or require extensive pre-training data. We introduce VRDSynth, a program synthesis method to automatically extract entity relations from multilingual VRDs without pre-training data. To capture the complexity of VRD domain, we design a domain-specific language (DSL) to capture spatial and textual relations to describe the synthesized programs. Along with this, we also derive a new synthesis algorithm utilizing frequent spatial relations, search space pruning, and a combination of positive, negative, and exclusive programs to improve coverage. We evaluate VRDSynth on the FUNSD and XFUND benchmarks for semantic entity linking, consisting of 1,592 forms in 8 languages. VRDSynth outperforms state-of-the-art pre-trained models (LayoutXLM, InfoXLMBase, and XLMRobertaBase) in 5, 6, and 7 out of 8 languages, respectively, improving the F1 score by 42% over LayoutXLM in English. To test the extensibility of the model, we further improve VRDSynth with automated table recognition, creating VRDSynth(Table), and compare it with extended versions of the pre-trained models, InfoXLM(Large) and XLMRoberta(Large). VRDSynth(Table) outperforms these baselines in 4 out of 8 languages and in average F1 score. VRDSynth also significantly reduces memory footprint (1M and 380MB vs. 1.48GB and 3GB for LayoutXLM) while maintaining similar time efficiency.

VRDSynth: Synthesizing Programs for Multilingual Visually Rich Document Information Extraction

TL;DR

This work tackles information extraction from multilingual visually rich documents under data scarcity by integrating Graph Neural Networks with program synthesis. It introduces Bridgar, a system that converts noisy GNN predictions into a domain-specific language to guide a probabilistic program synthesizer, and employs the EC^2 framework to learn reusable post-processing rules. The approach combines a multi-layer GCN, GNN Explainer-derived rules, a DSL-based transformation, and rule synthesis to refine extractions, aiming for better generalization and explainability. Empirical results demonstrate improved performance over baselines on VRD benchmarks and a notable reduction in memory footprint, underscoring the practical impact of combining deep learning with symbolic synthesis for VRD IE.

Abstract

Businesses need to query visually rich documents (VRDs) like receipts, medical records, and insurance forms to make decisions. Existing techniques for extracting entities from VRDs struggle with new layouts or require extensive pre-training data. We introduce VRDSynth, a program synthesis method to automatically extract entity relations from multilingual VRDs without pre-training data. To capture the complexity of VRD domain, we design a domain-specific language (DSL) to capture spatial and textual relations to describe the synthesized programs. Along with this, we also derive a new synthesis algorithm utilizing frequent spatial relations, search space pruning, and a combination of positive, negative, and exclusive programs to improve coverage. We evaluate VRDSynth on the FUNSD and XFUND benchmarks for semantic entity linking, consisting of 1,592 forms in 8 languages. VRDSynth outperforms state-of-the-art pre-trained models (LayoutXLM, InfoXLMBase, and XLMRobertaBase) in 5, 6, and 7 out of 8 languages, respectively, improving the F1 score by 42% over LayoutXLM in English. To test the extensibility of the model, we further improve VRDSynth with automated table recognition, creating VRDSynth(Table), and compare it with extended versions of the pre-trained models, InfoXLM(Large) and XLMRoberta(Large). VRDSynth(Table) outperforms these baselines in 4 out of 8 languages and in average F1 score. VRDSynth also significantly reduces memory footprint (1M and 380MB vs. 1.48GB and 3GB for LayoutXLM) while maintaining similar time efficiency.
Paper Structure (12 sections, 9 figures, 2 algorithms)

This paper contains 12 sections, 9 figures, 2 algorithms.

Figures (9)

  • Figure 1: Examples of visually-rich document and example entities to extract
  • Figure 2: Using Graph Neural Network to identify the desired fields, each text line, represented as a graph node in node-classification problem
  • Figure 3: Modified GNN Explainer optimize masks on adjacency tensor: the size loss is used for minimizing size, the entropy loss ensure the mask's tendency to be closer to 0 or 1 while the consistency loss regulate the optimization to keep the prediction unchanged.
  • Figure 4: Using Program Synthesis for label propagation from keys to values. First, we build relation between nodes from input text lines' coordinates and extract features via positional and textual encoding. Next, a Graph Neural Network is leveraged for node classification. Finally, To reinforce correct extraction for specific case, program synthesis is used to generate post-process rules.
  • Figure 5: The heuristic graphs built by considering the alignment between text lines, the red arrows represent left-right relation why the black arrows represent top-down relations
  • ...and 4 more figures