Table of Contents
Fetching ...

Spatial Dual-Modality Graph Reasoning for Key Information Extraction

Hongbin Sun, Zhanghui Kuang, Xiaoyu Yue, Chenhao Lin, Wayne Zhang

TL;DR

This work tackles robust key information extraction from diverse, unseen document templates under OCR noise. It introduces SDMG-R, a Spatial Dual-Modality Graph Reasoning framework that jointly reasons over visual and textual features of text regions connected by 2D spatial relations, with a dynamic graph attention mechanism and a compact fusion strategy. A new WildReceipt dataset with 25 categories and 69k text boxes is released to benchmark unseen-template KIE in the wild, and SDMG-R achieves state-of-the-art results on both WildReceipt and SROIE, including under OCR-related perturbations. The approach demonstrates the importance of integrating dual modalities and 2D layout context for robust KIE, and the authors provide code and data to support ongoing research.

Abstract

Key information extraction from document images is of paramount importance in office automation. Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors. In this paper, we propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images. We model document images as dual-modality graphs, nodes of which encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. The key information extraction is solved by iteratively propagating messages along graph edges and reasoning the categories of graph nodes. In order to roundly evaluate our proposed method as well as boost the future research, we release a new dataset named WildReceipt, which is collected and annotated tailored for the evaluation of key information extraction from document images of unseen templates in the wild. It contains 25 key information categories, a total of about 69000 text boxes, and is about 2 times larger than the existing public datasets. Extensive experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction. It has been shown that SDMG-R can effectively extract key information from document images of unseen templates, and obtain new state-of-the-art results on the recent popular benchmark SROIE and our WildReceipt. Our code and dataset will be publicly released.

Spatial Dual-Modality Graph Reasoning for Key Information Extraction

TL;DR

This work tackles robust key information extraction from diverse, unseen document templates under OCR noise. It introduces SDMG-R, a Spatial Dual-Modality Graph Reasoning framework that jointly reasons over visual and textual features of text regions connected by 2D spatial relations, with a dynamic graph attention mechanism and a compact fusion strategy. A new WildReceipt dataset with 25 categories and 69k text boxes is released to benchmark unseen-template KIE in the wild, and SDMG-R achieves state-of-the-art results on both WildReceipt and SROIE, including under OCR-related perturbations. The approach demonstrates the importance of integrating dual modalities and 2D layout context for robust KIE, and the authors provide code and data to support ongoing research.

Abstract

Key information extraction from document images is of paramount importance in office automation. Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors. In this paper, we propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images. We model document images as dual-modality graphs, nodes of which encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. The key information extraction is solved by iteratively propagating messages along graph edges and reasoning the categories of graph nodes. In order to roundly evaluate our proposed method as well as boost the future research, we release a new dataset named WildReceipt, which is collected and annotated tailored for the evaluation of key information extraction from document images of unseen templates in the wild. It contains 25 key information categories, a total of about 69000 text boxes, and is about 2 times larger than the existing public datasets. Extensive experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction. It has been shown that SDMG-R can effectively extract key information from document images of unseen templates, and obtain new state-of-the-art results on the recent popular benchmark SROIE and our WildReceipt. Our code and dataset will be publicly released.

Paper Structure

This paper contains 16 sections, 9 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Illustration of Named Entity Recognition (NER) and our proposed Spatial Dual-Modality Graph Reasoning (SDMG-R). NER models the relations between two text regions at the same horizontal line while our SDMG-R between all text regions in the spatial neighborhood. Moreover, NER use textual features only while SDMG-R both visual features extracted from image regions and textual ones extracted from texts.
  • Figure 2: The overall architecture of the proposed SDMG-R model for key information extraction. Given one image, visual features $\{\mathbf{v_i}\}$ are extracted via U-Net and ROI-Pooling while textual features $\{\mathbf{t_i}\}$ are extracted via one Bi-LSTM. The modality features are fused by Kronecker product approximated by the block-diagonal tensor decomposition in the Dual Modality Fusion Module before being fed into the Graph Reasoning Module, where the node features are propagated and aggregated, and the edge weights are dynamically learned. The final node features are classified into one of key information categories in the classification module.
  • Figure 3: Annotations and samples of WildReceipt. The left shows the annotated text bounding boxes (red) with their corresponding key information categories (blue); The right shows one receipt sample with folds, and one non-front sample in WildReceipt (best viewed in color).
  • Figure 4: Visualization of the learned dynamic weight $e_{ij}$ between text regions $i$ and $j$. Each row shows one text region (blue rectangle) and its related regions (red rectangles) of one receipt image. The first column shows the learned weights in the first Graph Convolution Layer (GCL) in our graph reasoning module while the second column shows those in the second GCL. We visualize the weights (the red numbers), which are bigger than 0.1, of edges (the red directed curves) incoming to one node only for clarity (best viewed in color).