Table of Contents
Fetching ...

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents

Kai Hu, Jiawei Wang, Weihong Lin, Zhuoyao Zhong, Lei Sun, Qiang Huo

TL;DR

This work addresses Visual Information Extraction (VIE) from form-like documents by tackling hierarchical structures that bottleneck prior methods. It reframes VIE as relation prediction under a unified label space, introducing UniVIE which combines a coarse-to-fine pipeline with a tree proposal network and a relation decoder to recover hierarchical key-value and choice-group structures. Two novel components, tree level embeddings and a tree attention mask, enhance the decoder's ability to model hierarchical dependencies, and a decoding algorithm yields hierarchical trees via a maximum-spanning-arborescence approach. On HierForms and SIBR, UniVIE achieves state-of-the-art performance, demonstrating the effectiveness of label unification and relation-prediction for robust, hierarchical VIE, with potential for zero-shot and few-shot scenarios.

Abstract

Existing methods for Visual Information Extraction (VIE) from form-like documents typically fragment the process into separate subtasks, such as key information extraction, key-value pair extraction, and choice group extraction. However, these approaches often overlook the hierarchical structure of form documents, including hierarchical key-value pairs and hierarchical choice groups. To address these limitations, we present a new perspective, reframing VIE as a relation prediction problem and unifying labels of different tasks into a single label space. This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents. In line with this perspective, we present UniVIE, a unified model that addresses the VIE problem comprehensively. UniVIE functions using a coarse-to-fine strategy. It initially generates tree proposals through a tree proposal network, which are subsequently refined into hierarchical trees by a relation decoder module. To enhance the relation prediction capabilities of UniVIE, we incorporate two novel tree constraints into the relation decoder: a tree attention mask and a tree level embedding. Extensive experimental evaluations on both our in-house dataset HierForms and a publicly available dataset SIBR, substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our unified approach in advancing the field of VIE.

UniVIE: A Unified Label Space Approach to Visual Information Extraction from Form-like Documents

TL;DR

This work addresses Visual Information Extraction (VIE) from form-like documents by tackling hierarchical structures that bottleneck prior methods. It reframes VIE as relation prediction under a unified label space, introducing UniVIE which combines a coarse-to-fine pipeline with a tree proposal network and a relation decoder to recover hierarchical key-value and choice-group structures. Two novel components, tree level embeddings and a tree attention mask, enhance the decoder's ability to model hierarchical dependencies, and a decoding algorithm yields hierarchical trees via a maximum-spanning-arborescence approach. On HierForms and SIBR, UniVIE achieves state-of-the-art performance, demonstrating the effectiveness of label unification and relation-prediction for robust, hierarchical VIE, with potential for zero-shot and few-shot scenarios.

Abstract

Existing methods for Visual Information Extraction (VIE) from form-like documents typically fragment the process into separate subtasks, such as key information extraction, key-value pair extraction, and choice group extraction. However, these approaches often overlook the hierarchical structure of form documents, including hierarchical key-value pairs and hierarchical choice groups. To address these limitations, we present a new perspective, reframing VIE as a relation prediction problem and unifying labels of different tasks into a single label space. This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents. In line with this perspective, we present UniVIE, a unified model that addresses the VIE problem comprehensively. UniVIE functions using a coarse-to-fine strategy. It initially generates tree proposals through a tree proposal network, which are subsequently refined into hierarchical trees by a relation decoder module. To enhance the relation prediction capabilities of UniVIE, we incorporate two novel tree constraints into the relation decoder: a tree attention mask and a tree level embedding. Extensive experimental evaluations on both our in-house dataset HierForms and a publicly available dataset SIBR, substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our unified approach in advancing the field of VIE.
Paper Structure (29 sections, 1 equation, 4 figures, 4 tables, 1 algorithm)

This paper contains 29 sections, 1 equation, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: An illustrative example of Visual Information Extraction using our proposed UniVIE model. (Orange rectangles represent text-lines, green rectangles represent choice widgets, and blue rectangles represent text widgets. Best viewed in color.)
  • Figure 2: An example of our unified label space for Visual Information Extraction: (a) named entities; (b) key-value pairs; (c) choice groups. (Yellow arrow: intra-company-name; red arrow: intra-address; shy blue arrow: intra-key; orange arrow: intra-value; pink arrow: inter-kvp; blue arrow: intra-cgt; purple arrow: intra-cf; green arrow: inter-cg; blue rectangle: hierarchical key-value pair and choice group. Best viewed in color.)
  • Figure 3: Overview of UniVIE for Visual Information Extraction.
  • Figure 4: A schematic view of the proposed Relation Decoder module.