UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents

Kai Hu; Jiawei Wang; Weihong Lin; Zhuoyao Zhong; Lei Sun; Qiang Huo

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents

Kai Hu, Jiawei Wang, Weihong Lin, Zhuoyao Zhong, Lei Sun, Qiang Huo

TL;DR

This work addresses Visual Information Extraction (VIE) from form-like documents by tackling hierarchical structures that bottleneck prior methods. It reframes VIE as relation prediction under a unified label space, introducing UniVIE which combines a coarse-to-fine pipeline with a tree proposal network and a relation decoder to recover hierarchical key-value and choice-group structures. Two novel components, tree level embeddings and a tree attention mask, enhance the decoder's ability to model hierarchical dependencies, and a decoding algorithm yields hierarchical trees via a maximum-spanning-arborescence approach. On HierForms and SIBR, UniVIE achieves state-of-the-art performance, demonstrating the effectiveness of label unification and relation-prediction for robust, hierarchical VIE, with potential for zero-shot and few-shot scenarios.

Abstract

Existing methods for Visual Information Extraction (VIE) from form-like documents typically fragment the process into separate subtasks, such as key information extraction, key-value pair extraction, and choice group extraction. However, these approaches often overlook the hierarchical structure of form documents, including hierarchical key-value pairs and hierarchical choice groups. To address these limitations, we present a new perspective, reframing VIE as a relation prediction problem and unifying labels of different tasks into a single label space. This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents. In line with this perspective, we present UniVIE, a unified model that addresses the VIE problem comprehensively. UniVIE functions using a coarse-to-fine strategy. It initially generates tree proposals through a tree proposal network, which are subsequently refined into hierarchical trees by a relation decoder module. To enhance the relation prediction capabilities of UniVIE, we incorporate two novel tree constraints into the relation decoder: a tree attention mask and a tree level embedding. Extensive experimental evaluations on both our in-house dataset HierForms and a publicly available dataset SIBR, substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our unified approach in advancing the field of VIE.

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents

TL;DR

Abstract

Paper Structure (29 sections, 1 equation, 4 figures, 4 tables, 1 algorithm)

This paper contains 29 sections, 1 equation, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Visual Information Extraction
Key Information Extraction.
Key-Value Pair Extraction.
Choice Group Extraction.
Table Filling Strategy
Problem Definition
Methodology
Composable Multimodal Backbone
Tree Proposal Network
Relation Proposal Prediction Head.
Relation Classification Head.
Relation Decoding Algorithm.
Relation Decoder
...and 14 more sections

Figures (4)

Figure 1: An illustrative example of Visual Information Extraction using our proposed UniVIE model. (Orange rectangles represent text-lines, green rectangles represent choice widgets, and blue rectangles represent text widgets. Best viewed in color.)
Figure 2: An example of our unified label space for Visual Information Extraction: (a) named entities; (b) key-value pairs; (c) choice groups. (Yellow arrow: intra-company-name; red arrow: intra-address; shy blue arrow: intra-key; orange arrow: intra-value; pink arrow: inter-kvp; blue arrow: intra-cgt; purple arrow: intra-cf; green arrow: inter-cg; blue rectangle: hierarchical key-value pair and choice group. Best viewed in color.)
Figure 3: Overview of UniVIE for Visual Information Extraction.
Figure 4: A schematic view of the proposed Relation Decoder module.

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents

TL;DR

Abstract

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents

Authors

TL;DR

Abstract

Table of Contents

Figures (4)