GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network

Panfeng Cao; Jian Wu

GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network

Panfeng Cao, Jian Wu

TL;DR

A light-weight model named GraphRevisedIE is proposed that effectively embeds multimodal features such as textual, visual, and layout features from VRD and leverages graph revision and graph convolution to enrich the multimodal embedding with global context.

Abstract

Key information extraction (KIE) from visually rich documents (VRD) has been a challenging task in document intelligence because of not only the complicated and diverse layouts of VRD that make the model hard to generalize but also the lack of methods to exploit the multimodal features in VRD. In this paper, we propose a light-weight model named GraphRevisedIE that effectively embeds multimodal features such as textual, visual, and layout features from VRD and leverages graph revision and graph convolution to enrich the multimodal embedding with global context. Extensive experiments on multiple real-world datasets show that GraphRevisedIE generalizes to documents of varied layouts and achieves comparable or better performance compared to previous KIE methods. We also publish a business license dataset that contains both real-life and synthesized documents to facilitate research of document KIE.

GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network

TL;DR

Abstract

Paper Structure (14 sections, 16 equations, 7 figures, 6 tables)

This paper contains 14 sections, 16 equations, 7 figures, 6 tables.

Introduction
Related Works
Model Architecture
Embedding
Graph Module
Decoding
Experiments
Datasets
Experiment Settings
Experiment results
Baseline
Results
Ablation Study
Conclusion

Figures (7)

Figure 1: Example VRD of different layouts. (a) Key entities to be extracted are marked with red rectangles. (b) Same text 03 results in semantic ambiguities for different entities. (c) Example business license.
Figure 2: Overall diagram of the GraphRevisedIE framework. Note that for illustration purposes, we use the same color for all tokens in the same segment and different colors for tokens in different segments. The top section of the diagram demonstrates the process of multimodal feature fusion. The bottom right section explains the graph module for feature embedding enrichment. Self-connected edges are omitted. The bottom left section is the BiLSTM-CRF module that calculates the CRF loss and produces the final prediction.
Figure 3: Illustration of generating the image embedding. Inputs are the raw image and bounding boxes of segments. RoI-Align is used to extract segment level features from the whole image feature produced by the CNN module. A convolution kernel is applied to transform the output dimension of RoI-Align to the model dimension.
Figure 4: Process of generating the relative positional embedding. Relative positions are first embeded with the sinusoidal embedding function $f$ and then go through a linear projection layer to get the final embedding.
Figure 5: The graph module illustrated on an example SROIE receipt. In the bottom right, segments corresponding to the nodes are given with the indexes and labels (o: other, c: company, a: address, d: date). We use an identity matrix as the initial graph. For simplicity, self-connected edges are omitted. A new segment embedding is produced by graph convolution on the revised graph using the original segment embedding.
...and 2 more figures

GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network

TL;DR

Abstract

GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network

Authors

TL;DR

Abstract

Table of Contents

Figures (7)